This folder contains a simple transcript cleaning script for WebVTT-style interview transcripts for qualitative nterviews.
cleaning.py will:
- parse transcript files from
data/raw - remove WebVTT timestamps and numbered cue lines
- merge consecutive lines from the same speaker
- normalize common misspellings specified in
words_to_correct.txt - remove filler phrases such as
you know,like, andso - collapse repeated ellipsis patterns like
x... x...,x... x, andx x... - write cleaned output to
data/processed
Spelling corrections are configured in words_to_correct.txt. Each line should contain a misspelling and its correction in the format:
misspelling -> correction
For example:
mavim -> MAVIM
ifad -> IFAD
navte jaswini -> Nav Tejaswini
The script will automatically read these corrections and apply them during processing. Add, remove, or modify entries in this file to customize the spelling corrections for your transcripts.
- Python 3.8+
From the repository root:
cd "...\CleanWEBVTTInterview"
python cleaning.pyThis will process all .txt files in data/raw and save cleaned versions to data/processed with the _cleaned.txt suffix.
To clean a single file manually, update the script call in the if __name__ == "__main__" section or import and call:
from cleaning import clean_transcript
clean_transcript("./data/raw/IFAD Liaison.txt")Cleaned transcripts are saved to the data/processed folder with names like:
[file]_cleaned.txt
- The script still needs to be processed to separate interviewer and interviewee text.
- The script is designed for txt files exported in WebVTT-like format.
- If you need to add new filler words, update
remove_filler_words()incleaning.py.