This is still a WIP as of 10.08.2025
WhisperX Transcription for Notetaking maniacs and Planners.
After cloning the repo and setting up the env: pip install . to install wxt command.
For sample audio placed in assets/sample
wxt assets/sample/audio.mp3
For other supported options see wxt --help.
- Summarization model: Any available on Ollama (developed with
gemma3:4b) - Transcription model: WhisperX Large v3
- Diarization model:
In retrieval mode, based on MTEB ranking:
- Text embedding model: Qwen3-Embedding-0.6B
Cluster backend at work does not support float16 computation (Quadro P4000)
Python 3.10.
Install ffmpeg, rust, cudnn=8.9.7 (faster-whisper-large-v3 looks for `libcudnn_ops_infer.so.8).
Setup ollama based on latest instructions from https://github.com/ollama/ollama/tree/main/docs
See pyproject.toml for python dependencies.
Accept terms for
- https://huggingface.co/pyannote/speaker-diarization-3.1
- https://huggingface.co/pyannote/segmentation-3.0
Add huggingface token to .env file:
MY_TOKEN=hf_xxx
Replicate provides inference. See colab.
Freemium (referral): Otter.ai
To strip audio (mp3) from video file, you can use the following command:
ffmpeg -i input_video.mp4 -f mp3 -vn -ar 44100 output_audio.mp3-vndisables video recording,-arsets the audio sample rate.-f mp3specifies the output format as mp3.
-
ctranslate2:ImportError: libctranslate2-d3638643.so.4.4.0: cannot enable executable stack as shared object requires: Invalid argumentshared object error. Fixed this issue with this on Manjaro-xfce. -
If the
huggingface_hubdownload takes longer, found that it was easier to just clone the repo, for example with their cli:hf download Systran/faster-whisper-large-v3and place it in thecache_dir,model/in this case. Thefaster-whisperalso depends on an olderhuggingface_hubversion that does not come withxet. The download only appears slower due to an inner loop with tqdm, which sometimes does not appear to update the outer, main progress bar -- probably related toxet's chunked downloads.