Compare commits
13 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| e2e19940dd | |||
| 0293a13177 | |||
| 8d5c8d6097 | |||
| e29572420e | |||
| f7d621e510 | |||
| 2a1df6aeba | |||
| 58255c3d10 | |||
| ea43074852 | |||
| 7b81778d9e | |||
| e65462f57b | |||
| 09e3e43c51 | |||
| d4c26f6c37 | |||
| acb6947f87 |
@@ -1,64 +1,85 @@
|
|||||||
## Local Transcribe with Whisper
|
## Local Transcribe with Whisper
|
||||||
|
|
||||||
> **⚠ Note for Mac users (Apple Silicon):** This version uses `faster-whisper` (CTranslate2), which does **not** support Apple M-chip GPU acceleration. Transcription will run on CPU, which is slower than OpenAI's Whisper with Metal/CoreML support. The trade-off is a much simpler installation — no conda, no PyTorch, no admin rights. If you'd prefer M-chip GPU acceleration and don't mind a more involved setup, switch to the **classic** release:
|
> **🍎 Apple Silicon GPU/NPU acceleration:** This version now supports native Apple GPU/NPU acceleration via [MLX Whisper](https://github.com/ml-explore/mlx-examples/tree/main/whisper). On Apple Silicon Macs, transcription runs on the Apple GPU and Neural Engine — no CPU fallback needed.
|
||||||
> ```
|
|
||||||
> git checkout classic
|
|
||||||
> ```
|
|
||||||
|
|
||||||
Local Transcribe with Whisper is a user-friendly desktop application that allows you to transcribe audio and video files using the Whisper ASR system, powered by [faster-whisper](https://github.com/SYSTRAN/faster-whisper) (CTranslate2). This application provides a graphical user interface (GUI) built with Python and the Tkinter library, making it easy to use even for those not familiar with programming.
|
Local Transcribe with Whisper is a user-friendly desktop application that allows you to transcribe audio and video files using the Whisper ASR system, powered by [faster-whisper](https://github.com/SYSTRAN/faster-whisper) (CTranslate2) on Windows/Linux and [MLX Whisper](https://github.com/ml-explore/mlx-examples/tree/main/whisper) on Apple Silicon. This application provides a graphical user interface (GUI) built with Python and the Tkinter library, making it easy to use even for those not familiar with programming.
|
||||||
|
|
||||||
|
## New in version 3.0!
|
||||||
|
1. **Apple Silicon GPU/NPU support** — native MLX backend for Apple Silicon Macs, using Apple GPU + Neural Engine.
|
||||||
|
2. **SRT subtitle export** — valid SubRip files alongside the existing TXT output, ready for HandBrake or any video player.
|
||||||
|
3. **VAD filter** — removes silence, reduces hallucination, improves accuracy.
|
||||||
|
4. **Word-level timestamps** — per-word SRT timing for precise subtitle burning.
|
||||||
|
5. **Translation mode** — transcribe any language and translate to English in one step.
|
||||||
|
6. **Stop button** — immediately cancel any transcription, including model downloads.
|
||||||
|
7. **Language dropdown** — 99 languages with proper ISO codes, no more guessing formats.
|
||||||
|
8. **Model descriptions** — speed, size, quality stars, and use case shown for every model.
|
||||||
|
|
||||||
## New in version 2.0!
|
## New in version 2.0!
|
||||||
1. **Switched to faster-whisper** — up to 4× faster transcription with lower memory usage.
|
1. **Switched to faster-whisper** — up to 4× faster transcription with lower memory usage, simpler installation.
|
||||||
2. **No separate FFmpeg installation needed** — audio decoding is handled by the bundled PyAV library.
|
2. **Swedish-optimised models** — [KB-Whisper](https://huggingface.co/collections/KBLab/kb-whisper) from the National Library of Sweden (KBLab)
|
||||||
3. **No admin rights required** — a plain `pip install` covers everything.
|
3. **No separate FFmpeg installation needed** — audio decoding is handled by the bundled PyAV library.
|
||||||
4. **No PyTorch dependency** — dramatically smaller install footprint.
|
4. **No admin rights required** — a plain `pip install` covers everything.
|
||||||
5. **`tiny` model added** — smallest and fastest option for quick drafts.
|
5. **No PyTorch dependency** — dramatically smaller install footprint.
|
||||||
|
6. **Integrated console** - all info in the same application.
|
||||||
|
7. **`tiny` model added** — smallest and fastest option.
|
||||||
|
|
||||||
|
|
||||||
## Features
|
## Features
|
||||||
* Select the folder containing the audio or video files you want to transcribe. Tested with m4a video.
|
* Select the folder containing the audio or video files you want to transcribe. Tested with m4a video.
|
||||||
* Choose the language of the files you are transcribing. You can either select a specific language or let the application automatically detect the language.
|
* Choose the language of the files you are transcribing from a dropdown of 99 supported languages, or let the application automatically detect the language.
|
||||||
* Select the Whisper model to use for the transcription. Available models include "tiny", "tiny.en", "base", "base.en", "small", "small.en", "medium", "medium.en", "large-v2", and "large-v3". Models with .en ending are better if you're transcribing English, especially the base and small models.
|
* Select the Whisper model to use for the transcription. Available models include "tiny", "tiny.en", "base", "base.en", "small", "small.en", "medium", "medium.en", "large-v2", and "large-v3". Models with .en ending are better if you're transcribing English, especially the base and small models.
|
||||||
* **Swedish-optimised models** — [KB-Whisper](https://huggingface.co/collections/KBLab/kb-whisper) from the National Library of Sweden (KBLab) is available in all sizes (tiny → large). These models reduce Word Error Rate by up to 47 % compared to OpenAI Whisper on Swedish speech. The language is set to Swedish automatically when a KB model is selected.
|
* **Swedish-optimised models** — [KB-Whisper](https://huggingface.co/collections/KBLab/kb-whisper) from the National Library of Sweden (KBLab) is available in all sizes (tiny → large). These models reduce Word Error Rate by up to 47 % compared to OpenAI Whisper on Swedish speech. The language is set to Swedish automatically when a KB model is selected.
|
||||||
* Enable the verbose mode to receive detailed information during the transcription process.
|
* **VAD filter** — removes silence from audio before transcription, reducing hallucination and improving accuracy.
|
||||||
* Monitor the progress of the transcription with the progress bar and terminal.
|
* **Word-level timestamps** — generates per-word timing in the SRT output for precise subtitle synchronization.
|
||||||
|
* **Translation mode** — transcribes audio in any language and translates the result to English.
|
||||||
|
* **SRT export** — valid SubRip subtitle files saved alongside TXT, ready for HandBrake or any video player.
|
||||||
|
* Monitor the progress of the transcription with the progress bar and terminal.
|
||||||
* Confirmation dialog before starting the transcription to ensure you have selected the correct folder.
|
* Confirmation dialog before starting the transcription to ensure you have selected the correct folder.
|
||||||
* View the transcribed text in a message box once the transcription is completed.
|
* View the transcribed text in a message box once the transcription is completed.
|
||||||
|
* **Stop button** — immediately cancel transcription, including model downloads.
|
||||||
|
|
||||||
## Installation
|
## Installation
|
||||||
### Get the files
|
### Get the files
|
||||||
Download the zip folder and extract it to your preferred working folder.
|
Download the zip folder and extract it to your preferred working folder.
|
||||||

|

|
||||||
Or by cloning the repository with:
|
Or by cloning the repository with:
|
||||||
```
|
```
|
||||||
git clone https://github.com/soderstromkr/transcribe.git
|
git clone https://gitea.kobim.cloud/kobim/whisper-local-transcribe.git
|
||||||
```
|
```
|
||||||
### Python Version **(any platform including Mac users)**
|
### Prerequisites
|
||||||
1. Install Python 3.10 or later. You can download it from [python.org](https://www.python.org/downloads/). During installation, **check "Add Python to PATH"**. No administrator rights are needed if you install for your user only.
|
Install **Python 3.10 or later**. Some IT policies allow installing from the Microsoft Store or Mac equivalent. However, I would prefer an install from [python.org](https://www.python.org/downloads/). During installation, **check "Add Python to PATH"**. No administrator rights are needed if you install for your user only.
|
||||||
|
|
||||||
2. Run the installer. Open a terminal (Command Prompt on Windows, Terminal on Mac/Linux) in the project folder and run:
|
### Run on Windows
|
||||||
|
Double-click `run_Windows.bat` — it will auto-install everything on first run.
|
||||||
|
|
||||||
|
### Run on Mac / Linux
|
||||||
|
Run `./run_Mac.sh` — it will auto-install everything on first run. See [Mac instructions](Mac_instructions.md) for details.
|
||||||
|
|
||||||
|
> **Note:** The first run with a given model will download it (~75 MB for base, ~500 MB for medium). After that, everything works offline.
|
||||||
|
|
||||||
|
### Manual installation (if the launchers don't work)
|
||||||
|
If `run_Windows.bat` or `run_Mac.sh` fails (e.g. Python isn't on PATH, or permissions issues), open a terminal in the project folder and run these steps manually:
|
||||||
|
```
|
||||||
|
python -m venv .venv
|
||||||
|
```
|
||||||
|
Activate the virtual environment:
|
||||||
|
- **Windows:** `.venv\Scripts\activate`
|
||||||
|
- **Mac / Linux:** `source .venv/bin/activate`
|
||||||
|
|
||||||
|
Then install and run:
|
||||||
```
|
```
|
||||||
python install.py
|
python install.py
|
||||||
```
|
|
||||||
This will:
|
|
||||||
- Install all required packages (including bundled FFmpeg — no separate install needed)
|
|
||||||
- **Auto-detect your NVIDIA GPU** and ask if you want GPU acceleration
|
|
||||||
- No conda, no admin rights required
|
|
||||||
|
|
||||||
Alternatively, you can install manually with `pip install -r requirements.txt`.
|
|
||||||
|
|
||||||
3. Run the app:
|
|
||||||
1. For **Windows**: double-click `run_Windows.bat` (it will auto-install on first run) or run:
|
|
||||||
```
|
|
||||||
python app.py
|
python app.py
|
||||||
```
|
```
|
||||||
2. For **Mac / Linux**: run `./run_Mac.sh` (auto-installs on first run). See [Mac instructions](Mac_instructions.md) for details.
|
|
||||||
|
|
||||||
**Note** The first run with a given model will download it (~75 MB for base, ~500 MB for medium). After that, everything works offline.
|
|
||||||
|
|
||||||
## GPU Support
|
## GPU Support
|
||||||
|
### Apple Silicon
|
||||||
|
On Macs with Apple Silicon, the app automatically uses the **MLX backend**, which runs inference on the Apple GPU and Neural Engine. No additional setup is needed — just install and run. MLX models are downloaded from HuggingFace on first use.
|
||||||
|
|
||||||
|
### NVIDIA GPUs
|
||||||
This program **does support running on NVIDIA GPUs**, which can significantly speed up transcription times. faster-whisper uses CTranslate2, which requires NVIDIA CUDA libraries for GPU acceleration.
|
This program **does support running on NVIDIA GPUs**, which can significantly speed up transcription times. faster-whisper uses CTranslate2, which requires NVIDIA CUDA libraries for GPU acceleration.
|
||||||
|
|
||||||
### Automatic Detection
|
#### Automatic Detection
|
||||||
The `install.py` script **automatically detects NVIDIA GPUs** and will ask if you want to install GPU support. If you skipped it during installation, you can add it anytime:
|
The `install.py` script **automatically detects NVIDIA GPUs** and will ask if you want to install GPU support. If you skipped it during installation, you can add it anytime:
|
||||||
```
|
```
|
||||||
pip install nvidia-cublas-cu12 nvidia-cudnn-cu12
|
pip install nvidia-cublas-cu12 nvidia-cudnn-cu12
|
||||||
@@ -66,7 +87,7 @@ pip install nvidia-cublas-cu12 nvidia-cudnn-cu12
|
|||||||
|
|
||||||
**Note:** Make sure your NVIDIA GPU drivers are up to date. You can check by running `nvidia-smi` in your terminal. The program will automatically detect and use your GPU if available, otherwise it falls back to CPU.
|
**Note:** Make sure your NVIDIA GPU drivers are up to date. You can check by running `nvidia-smi` in your terminal. The program will automatically detect and use your GPU if available, otherwise it falls back to CPU.
|
||||||
|
|
||||||
### Verifying GPU Support
|
#### Verifying GPU Support
|
||||||
After installation, you can verify that your GPU is available by running:
|
After installation, you can verify that your GPU is available by running:
|
||||||
```python
|
```python
|
||||||
import ctranslate2
|
import ctranslate2
|
||||||
@@ -75,14 +96,16 @@ print(ctranslate2.get_supported_compute_types("cuda"))
|
|||||||
If this returns a list containing `"float16"`, GPU acceleration is working.
|
If this returns a list containing `"float16"`, GPU acceleration is working.
|
||||||
|
|
||||||
## Usage
|
## Usage
|
||||||
1. Launch the app — the built-in console panel at the bottom shows a welcome message and all progress updates.
|
1. Launch the app — the built-in console panel at the bottom shows a welcome message and all progress updates. The backend indicator at the bottom shows which inference engine is active (MLX · Apple GPU/NPU, CUDA · GPU, or CPU · int8).
|
||||||
2. Select the folder containing the audio or video files you want to transcribe by clicking the "Browse" button next to the "Folder" label. This will open a file dialog where you can navigate to the desired folder. Remember, you won't be choosing individual files but whole folders!
|
2. Select the folder containing the audio or video files you want to transcribe by clicking the "Browse" button next to the "Folder" label. This will open a file dialog where you can navigate to the desired folder. Remember, you won't be choosing individual files but whole folders!
|
||||||
3. Enter the desired language for the transcription in the "Language" field. You can either select a language or leave it blank to enable automatic language detection.
|
3. Select the language from the dropdown — 99 languages are available, or leave it on "Auto-detect". For English-only models (.en) the language is locked to English; for KB Swedish models it's locked to Swedish.
|
||||||
4. Choose the Whisper model to use for the transcription from the dropdown list next to the "Model" label.
|
4. Choose the Whisper model to use for the transcription from the dropdown list next to the "Model" label. A description below shows speed, size, quality stars, and recommended use case for each model.
|
||||||
5. Click the "Transcribe" button to start the transcription. The button will be disabled during the process to prevent multiple transcriptions at once.
|
5. Toggle advanced options if needed: **VAD filter**, **Word-level timestamps**, or **Translate to English**.
|
||||||
6. Monitor progress in the embedded console panel — it shows model loading, per-file progress, and segment timestamps in real time.
|
6. Click the "Transcribe" button to start the transcription. Use the "Stop" button to cancel at any time.
|
||||||
7. Once the transcription is completed, a message box will appear displaying the result. Click "OK" to close it.
|
7. Monitor progress in the embedded console panel — it shows model loading, per-file progress, and segment timestamps in real time.
|
||||||
8. You can run the application again or quit at any time by clicking the "Quit" button.
|
8. Once the transcription is completed, a message box will appear displaying the result. Click "OK" to close it.
|
||||||
|
9. Transcriptions are saved as both `.txt` (human-readable) and `.srt` (SubRip subtitles) in the `transcriptions/` folder within the selected directory.
|
||||||
|
10. You can run the application again or quit at any time by clicking the "Quit" button.
|
||||||
|
|
||||||
## Jupyter Notebook
|
## Jupyter Notebook
|
||||||
Don't want fancy EXEs or GUIs? Use the function as is. See [example](example.ipynb) for an implementation on Jupyter Notebook.
|
Don't want fancy EXEs or GUIs? Use the function as is. See [example](example.ipynb) for an implementation on Jupyter Notebook.
|
||||||
|
|||||||
@@ -4,7 +4,8 @@ import tkinter as tk
|
|||||||
from tkinter import ttk
|
from tkinter import ttk
|
||||||
from tkinter import filedialog
|
from tkinter import filedialog
|
||||||
from tkinter import messagebox
|
from tkinter import messagebox
|
||||||
from src._LocalTranscribe import transcribe, get_path
|
from src._LocalTranscribe import transcribe, get_path, detect_backend, _transcribe_worker_process
|
||||||
|
import multiprocessing as mp
|
||||||
import customtkinter
|
import customtkinter
|
||||||
import threading
|
import threading
|
||||||
|
|
||||||
@@ -46,11 +47,121 @@ HF_MODEL_MAP = {
|
|||||||
'KB Swedish (large)': 'KBLab/kb-whisper-large',
|
'KB Swedish (large)': 'KBLab/kb-whisper-large',
|
||||||
}
|
}
|
||||||
|
|
||||||
|
# Per-model info shown in the UI description label
|
||||||
|
# (speed, size, quality stars, suggested use)
|
||||||
|
MODEL_INFO = {
|
||||||
|
'tiny': ('Very fast', '~75 MB', '★★☆☆☆', 'Quick drafts & testing'),
|
||||||
|
'tiny.en': ('Very fast', '~75 MB', '★★☆☆☆', 'Quick drafts & testing (English only)'),
|
||||||
|
'base': ('Fast', '~145 MB', '★★★☆☆', 'Notes & short podcasts'),
|
||||||
|
'base.en': ('Fast', '~145 MB', '★★★☆☆', 'Notes & short podcasts (English only)'),
|
||||||
|
'small': ('Balanced', '~485 MB', '★★★★☆', 'Everyday use'),
|
||||||
|
'small.en': ('Balanced', '~485 MB', '★★★★☆', 'Everyday use (English only)'),
|
||||||
|
'medium': ('Accurate', '~1.5 GB', '★★★★☆', 'Professional content'),
|
||||||
|
'medium.en': ('Accurate', '~1.5 GB', '★★★★☆', 'Professional content (English only)'),
|
||||||
|
'large-v2': ('Slow', '~3 GB', '★★★★★', 'Maximum accuracy'),
|
||||||
|
'large-v3': ('Slow', '~3 GB', '★★★★★', 'Maximum accuracy (recommended)'),
|
||||||
|
'KB Swedish (tiny)': ('Very fast', '~75 MB', '★★★☆☆', 'Swedish — optimised by KBLab'),
|
||||||
|
'KB Swedish (base)': ('Fast', '~145 MB', '★★★☆☆', 'Swedish — optimised by KBLab'),
|
||||||
|
'KB Swedish (small)': ('Balanced', '~485 MB', '★★★★☆', 'Swedish — optimised by KBLab'),
|
||||||
|
'KB Swedish (medium)': ('Accurate', '~1.5 GB', '★★★★☆', 'Swedish — optimised by KBLab'),
|
||||||
|
'KB Swedish (large)': ('Slow', '~3 GB', '★★★★★', 'Swedish — KBLab, best accuracy'),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
customtkinter.set_appearance_mode("System")
|
customtkinter.set_appearance_mode("System")
|
||||||
customtkinter.set_default_color_theme("blue") # Themes: blue (default), dark-blue, green
|
customtkinter.set_default_color_theme("blue") # Themes: blue (default), dark-blue, green
|
||||||
firstclick = True
|
|
||||||
|
# All languages supported by Whisper (display label → ISO code; None = auto-detect)
|
||||||
|
WHISPER_LANGUAGES = {
|
||||||
|
'Auto-detect': None,
|
||||||
|
'Afrikaans (af)': 'af', 'Albanian (sq)': 'sq',
|
||||||
|
'Amharic (am)': 'am', 'Arabic (ar)': 'ar',
|
||||||
|
'Armenian (hy)': 'hy', 'Assamese (as)': 'as',
|
||||||
|
'Azerbaijani (az)': 'az', 'Bashkir (ba)': 'ba',
|
||||||
|
'Basque (eu)': 'eu', 'Belarusian (be)': 'be',
|
||||||
|
'Bengali (bn)': 'bn', 'Bosnian (bs)': 'bs',
|
||||||
|
'Breton (br)': 'br', 'Bulgarian (bg)': 'bg',
|
||||||
|
'Catalan (ca)': 'ca', 'Chinese (zh)': 'zh',
|
||||||
|
'Croatian (hr)': 'hr', 'Czech (cs)': 'cs',
|
||||||
|
'Danish (da)': 'da', 'Dutch (nl)': 'nl',
|
||||||
|
'English (en)': 'en', 'Estonian (et)': 'et',
|
||||||
|
'Faroese (fo)': 'fo', 'Finnish (fi)': 'fi',
|
||||||
|
'French (fr)': 'fr', 'Galician (gl)': 'gl',
|
||||||
|
'Georgian (ka)': 'ka', 'German (de)': 'de',
|
||||||
|
'Greek (el)': 'el', 'Gujarati (gu)': 'gu',
|
||||||
|
'Haitian Creole (ht)': 'ht', 'Hausa (ha)': 'ha',
|
||||||
|
'Hawaiian (haw)': 'haw', 'Hebrew (he)': 'he',
|
||||||
|
'Hindi (hi)': 'hi', 'Hungarian (hu)': 'hu',
|
||||||
|
'Icelandic (is)': 'is', 'Indonesian (id)': 'id',
|
||||||
|
'Italian (it)': 'it', 'Japanese (ja)': 'ja',
|
||||||
|
'Javanese (jw)': 'jw', 'Kannada (kn)': 'kn',
|
||||||
|
'Kazakh (kk)': 'kk', 'Khmer (km)': 'km',
|
||||||
|
'Korean (ko)': 'ko', 'Lao (lo)': 'lo',
|
||||||
|
'Latin (la)': 'la', 'Latvian (lv)': 'lv',
|
||||||
|
'Lingala (ln)': 'ln', 'Lithuanian (lt)': 'lt',
|
||||||
|
'Luxembourgish (lb)': 'lb', 'Macedonian (mk)': 'mk',
|
||||||
|
'Malagasy (mg)': 'mg', 'Malay (ms)': 'ms',
|
||||||
|
'Malayalam (ml)': 'ml', 'Maltese (mt)': 'mt',
|
||||||
|
'Maori (mi)': 'mi', 'Marathi (mr)': 'mr',
|
||||||
|
'Mongolian (mn)': 'mn', 'Myanmar (my)': 'my',
|
||||||
|
'Nepali (ne)': 'ne', 'Norwegian (no)': 'no',
|
||||||
|
'Occitan (oc)': 'oc', 'Pashto (ps)': 'ps',
|
||||||
|
'Persian (fa)': 'fa', 'Polish (pl)': 'pl',
|
||||||
|
'Portuguese (pt)': 'pt', 'Punjabi (pa)': 'pa',
|
||||||
|
'Romanian (ro)': 'ro', 'Russian (ru)': 'ru',
|
||||||
|
'Sanskrit (sa)': 'sa', 'Serbian (sr)': 'sr',
|
||||||
|
'Shona (sn)': 'sn', 'Sindhi (sd)': 'sd',
|
||||||
|
'Sinhala (si)': 'si', 'Slovak (sk)': 'sk',
|
||||||
|
'Slovenian (sl)': 'sl', 'Somali (so)': 'so',
|
||||||
|
'Spanish (es)': 'es', 'Sundanese (su)': 'su',
|
||||||
|
'Swahili (sw)': 'sw', 'Swedish (sv)': 'sv',
|
||||||
|
'Tagalog (tl)': 'tl', 'Tajik (tg)': 'tg',
|
||||||
|
'Tamil (ta)': 'ta', 'Tatar (tt)': 'tt',
|
||||||
|
'Telugu (te)': 'te', 'Thai (th)': 'th',
|
||||||
|
'Tibetan (bo)': 'bo', 'Turkish (tr)': 'tr',
|
||||||
|
'Turkmen (tk)': 'tk', 'Ukrainian (uk)': 'uk',
|
||||||
|
'Urdu (ur)': 'ur', 'Uzbek (uz)': 'uz',
|
||||||
|
'Vietnamese (vi)': 'vi', 'Welsh (cy)': 'cy',
|
||||||
|
'Yiddish (yi)': 'yi', 'Yoruba (yo)': 'yo',
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _language_options_for_model(model_name):
|
||||||
|
"""Return (values, default, state) for the language combobox given a model name."""
|
||||||
|
if model_name.endswith('.en'):
|
||||||
|
return ['English (en)'], 'English (en)', 'disabled'
|
||||||
|
if model_name.startswith('KB Swedish'):
|
||||||
|
return ['Swedish (sv)'], 'Swedish (sv)', 'disabled'
|
||||||
|
return list(WHISPER_LANGUAGES.keys()), 'Auto-detect', 'readonly'
|
||||||
|
|
||||||
|
|
||||||
|
def _set_app_icon(root):
|
||||||
|
"""Set app icon when supported, without crashing on unsupported platforms."""
|
||||||
|
base_dir = os.path.dirname(os.path.abspath(__file__))
|
||||||
|
icon_path = os.path.join(base_dir, "images", "icon.ico")
|
||||||
|
|
||||||
|
if not os.path.exists(icon_path):
|
||||||
|
return
|
||||||
|
|
||||||
|
try:
|
||||||
|
root.iconbitmap(icon_path)
|
||||||
|
except tk.TclError:
|
||||||
|
# Some Linux Tk builds don't accept .ico for iconbitmap.
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
def _apply_display_scaling(root):
|
||||||
|
"""Auto-scale UI for high-resolution displays (e.g., 4K)."""
|
||||||
|
try:
|
||||||
|
screen_w = root.winfo_screenwidth()
|
||||||
|
screen_h = root.winfo_screenheight()
|
||||||
|
scale = min(screen_w / 1920.0, screen_h / 1080.0)
|
||||||
|
scale = max(1.0, min(scale, 2.0))
|
||||||
|
customtkinter.set_widget_scaling(scale)
|
||||||
|
customtkinter.set_window_scaling(scale)
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
class App:
|
class App:
|
||||||
def __init__(self, master):
|
def __init__(self, master):
|
||||||
@@ -66,22 +177,16 @@ class App:
|
|||||||
self.path_entry.insert(0, os.path.join(os.getcwd(), 'sample_audio'))
|
self.path_entry.insert(0, os.path.join(os.getcwd(), 'sample_audio'))
|
||||||
self.path_entry.pack(side=tk.LEFT, fill=tk.X, expand=True)
|
self.path_entry.pack(side=tk.LEFT, fill=tk.X, expand=True)
|
||||||
customtkinter.CTkButton(path_frame, text="Browse", command=self.browse, font=font).pack(side=tk.LEFT, padx=5)
|
customtkinter.CTkButton(path_frame, text="Browse", command=self.browse, font=font).pack(side=tk.LEFT, padx=5)
|
||||||
# Language frame
|
# Language frame
|
||||||
#thanks to pommicket from Stackoverflow for this fix
|
|
||||||
def on_entry_click(event):
|
|
||||||
"""function that gets called whenever entry is clicked"""
|
|
||||||
global firstclick
|
|
||||||
if firstclick: # if this is the first time they clicked it
|
|
||||||
firstclick = False
|
|
||||||
self.language_entry.delete(0, "end") # delete all the text in the entry
|
|
||||||
language_frame = customtkinter.CTkFrame(master)
|
language_frame = customtkinter.CTkFrame(master)
|
||||||
language_frame.pack(fill=tk.BOTH, padx=10, pady=10)
|
language_frame.pack(fill=tk.BOTH, padx=10, pady=10)
|
||||||
customtkinter.CTkLabel(language_frame, text="Language:", font=font).pack(side=tk.LEFT, padx=5)
|
customtkinter.CTkLabel(language_frame, text="Language:", font=font).pack(side=tk.LEFT, padx=5)
|
||||||
self.language_entry = customtkinter.CTkEntry(language_frame, width=50, font=('Roboto', 12, 'italic'))
|
_lang_values, _lang_default, _lang_state = _language_options_for_model('medium')
|
||||||
self.default_language_text = "Enter language (or ignore to auto-detect)"
|
self.language_combobox = customtkinter.CTkComboBox(
|
||||||
self.language_entry.insert(0, self.default_language_text)
|
language_frame, width=50, state=_lang_state,
|
||||||
self.language_entry.bind('<FocusIn>', on_entry_click)
|
values=_lang_values, font=font_b)
|
||||||
self.language_entry.pack(side=tk.LEFT, fill=tk.X, expand=True)
|
self.language_combobox.set(_lang_default)
|
||||||
|
self.language_combobox.pack(side=tk.LEFT, fill=tk.X, expand=True)
|
||||||
# Model frame
|
# Model frame
|
||||||
models = ['tiny', 'tiny.en', 'base', 'base.en',
|
models = ['tiny', 'tiny.en', 'base', 'base.en',
|
||||||
'small', 'small.en', 'medium', 'medium.en',
|
'small', 'small.en', 'medium', 'medium.en',
|
||||||
@@ -96,16 +201,54 @@ class App:
|
|||||||
# ComboBox frame
|
# ComboBox frame
|
||||||
self.model_combobox = customtkinter.CTkComboBox(
|
self.model_combobox = customtkinter.CTkComboBox(
|
||||||
model_frame, width=50, state="readonly",
|
model_frame, width=50, state="readonly",
|
||||||
values=models, font=font_b)
|
values=models, font=font_b,
|
||||||
|
command=self._on_model_change)
|
||||||
self.model_combobox.set('medium') # Set the default value
|
self.model_combobox.set('medium') # Set the default value
|
||||||
self.model_combobox.pack(side=tk.LEFT, fill=tk.X, expand=True)
|
self.model_combobox.pack(side=tk.LEFT, fill=tk.X, expand=True)
|
||||||
|
# Model description label
|
||||||
|
self.model_desc_label = customtkinter.CTkLabel(
|
||||||
|
master, text=self._model_desc_text('medium'),
|
||||||
|
font=('Roboto', 11), text_color=('#555555', '#aaaaaa'),
|
||||||
|
anchor='w')
|
||||||
|
self.model_desc_label.pack(fill=tk.X, padx=14, pady=(0, 4))
|
||||||
|
# Timestamps toggle
|
||||||
|
ts_frame = customtkinter.CTkFrame(master)
|
||||||
|
ts_frame.pack(fill=tk.BOTH, padx=10, pady=10)
|
||||||
|
self.timestamps_var = tk.BooleanVar(value=True)
|
||||||
|
self.timestamps_switch = customtkinter.CTkSwitch(
|
||||||
|
ts_frame, text="Include timestamps in transcription",
|
||||||
|
variable=self.timestamps_var, font=font_b)
|
||||||
|
self.timestamps_switch.pack(side=tk.LEFT, padx=5)
|
||||||
|
# Advanced options frame
|
||||||
|
adv_frame = customtkinter.CTkFrame(master)
|
||||||
|
adv_frame.pack(fill=tk.BOTH, padx=10, pady=10)
|
||||||
|
self.vad_var = tk.BooleanVar(value=False)
|
||||||
|
customtkinter.CTkSwitch(
|
||||||
|
adv_frame, text="VAD filter (remove silence)",
|
||||||
|
variable=self.vad_var, font=font_b).pack(side=tk.LEFT, padx=5)
|
||||||
|
self.word_ts_var = tk.BooleanVar(value=False)
|
||||||
|
customtkinter.CTkSwitch(
|
||||||
|
adv_frame, text="Word-level timestamps",
|
||||||
|
variable=self.word_ts_var, font=font_b).pack(side=tk.LEFT, padx=5)
|
||||||
|
self.translate_var = tk.BooleanVar(value=False)
|
||||||
|
customtkinter.CTkSwitch(
|
||||||
|
adv_frame, text="Translate to English",
|
||||||
|
variable=self.translate_var, font=font_b).pack(side=tk.LEFT, padx=5)
|
||||||
# Progress Bar
|
# Progress Bar
|
||||||
self.progress_bar = ttk.Progressbar(master, length=200, mode='indeterminate')
|
self.progress_bar = ttk.Progressbar(master, length=200, mode='indeterminate')
|
||||||
|
# Worker process handle (replaces thread+stop_event for true immediate cancellation)
|
||||||
|
self._proc = None
|
||||||
|
self._parent_conn = None
|
||||||
|
self._child_conn = None
|
||||||
# Button actions frame
|
# Button actions frame
|
||||||
button_frame = customtkinter.CTkFrame(master)
|
button_frame = customtkinter.CTkFrame(master)
|
||||||
button_frame.pack(fill=tk.BOTH, padx=10, pady=10)
|
button_frame.pack(fill=tk.BOTH, padx=10, pady=10)
|
||||||
self.transcribe_button = customtkinter.CTkButton(button_frame, text="Transcribe", command=self.start_transcription, font=font)
|
self.transcribe_button = customtkinter.CTkButton(button_frame, text="Transcribe", command=self.start_transcription, font=font)
|
||||||
self.transcribe_button.pack(side=tk.LEFT, padx=5, pady=10, fill=tk.X, expand=True)
|
self.transcribe_button.pack(side=tk.LEFT, padx=5, pady=10, fill=tk.X, expand=True)
|
||||||
|
self.stop_button = customtkinter.CTkButton(
|
||||||
|
button_frame, text="Stop", command=self._stop_transcription, font=font,
|
||||||
|
fg_color="#c0392b", hover_color="#922b21", state=tk.DISABLED)
|
||||||
|
self.stop_button.pack(side=tk.LEFT, padx=5, pady=10, fill=tk.X, expand=True)
|
||||||
customtkinter.CTkButton(button_frame, text="Quit", command=master.quit, font=font).pack(side=tk.RIGHT, padx=5, pady=10, fill=tk.X, expand=True)
|
customtkinter.CTkButton(button_frame, text="Quit", command=master.quit, font=font).pack(side=tk.RIGHT, padx=5, pady=10, fill=tk.X, expand=True)
|
||||||
|
|
||||||
# ── Embedded console / log panel ──────────────────────────────────
|
# ── Embedded console / log panel ──────────────────────────────────
|
||||||
@@ -120,11 +263,59 @@ class App:
|
|||||||
sys.stdout = _ConsoleRedirector(self.log_box)
|
sys.stdout = _ConsoleRedirector(self.log_box)
|
||||||
sys.stderr = _ConsoleRedirector(self.log_box)
|
sys.stderr = _ConsoleRedirector(self.log_box)
|
||||||
|
|
||||||
|
# Backend indicator
|
||||||
|
_bi = detect_backend()
|
||||||
|
backend_label = customtkinter.CTkLabel(
|
||||||
|
master,
|
||||||
|
text=f"Backend: {_bi['label']}",
|
||||||
|
font=('Roboto', 11),
|
||||||
|
text_color=("#555555", "#aaaaaa"),
|
||||||
|
anchor='e',
|
||||||
|
)
|
||||||
|
backend_label.pack(fill=tk.X, padx=12, pady=(0, 2))
|
||||||
|
|
||||||
# Welcome message (shown after redirect so it appears in the panel)
|
# Welcome message (shown after redirect so it appears in the panel)
|
||||||
print("Welcome to Local Transcribe with Whisper! \U0001f600")
|
print("Welcome to Local Transcribe with Whisper! \U0001f600")
|
||||||
print("Transcriptions will be saved automatically.")
|
print("Transcriptions will be saved automatically.")
|
||||||
print("─" * 46)
|
print("─" * 46)
|
||||||
# Helper functions
|
# Helper functions
|
||||||
|
def _stop_transcription(self):
|
||||||
|
self.stop_button.configure(state=tk.DISABLED)
|
||||||
|
if self._proc and self._proc.is_alive():
|
||||||
|
self._proc.terminate()
|
||||||
|
try:
|
||||||
|
self._proc.join(timeout=3)
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
if self._proc.is_alive():
|
||||||
|
self._proc.kill()
|
||||||
|
try:
|
||||||
|
self._proc.join(timeout=1)
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
# Close pipe ends — no semaphores, so no leak
|
||||||
|
for conn in (self._parent_conn, self._child_conn):
|
||||||
|
try:
|
||||||
|
if conn:
|
||||||
|
conn.close()
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
self._parent_conn = self._child_conn = None
|
||||||
|
print("⛔ Transcription stopped by user.")
|
||||||
|
|
||||||
|
def _model_desc_text(self, model_name):
|
||||||
|
info = MODEL_INFO.get(model_name)
|
||||||
|
if not info:
|
||||||
|
return ''
|
||||||
|
speed, size, stars, use = info
|
||||||
|
return f'{stars} {speed} · {size} · {use}'
|
||||||
|
|
||||||
|
def _on_model_change(self, selected):
|
||||||
|
self.model_desc_label.configure(text=self._model_desc_text(selected))
|
||||||
|
values, default, state = _language_options_for_model(selected)
|
||||||
|
self.language_combobox.configure(values=values, state=state)
|
||||||
|
self.language_combobox.set(default)
|
||||||
|
|
||||||
# Browsing
|
# Browsing
|
||||||
def browse(self):
|
def browse(self):
|
||||||
initial_dir = os.getcwd()
|
initial_dir = os.getcwd()
|
||||||
@@ -133,64 +324,83 @@ class App:
|
|||||||
self.path_entry.insert(0, folder_path)
|
self.path_entry.insert(0, folder_path)
|
||||||
# Start transcription
|
# Start transcription
|
||||||
def start_transcription(self):
|
def start_transcription(self):
|
||||||
# Disable transcribe button
|
|
||||||
self.transcribe_button.configure(state=tk.DISABLED)
|
|
||||||
# Start a new thread for the transcription process
|
|
||||||
threading.Thread(target=self.transcribe_thread).start()
|
|
||||||
# Threading
|
|
||||||
def transcribe_thread(self):
|
|
||||||
path = self.path_entry.get()
|
|
||||||
model_display = self.model_combobox.get()
|
model_display = self.model_combobox.get()
|
||||||
# Ignore the visual separator
|
|
||||||
if model_display.startswith('─'):
|
if model_display.startswith('─'):
|
||||||
messagebox.showinfo("Invalid selection", "Please select a model, not the separator line.")
|
messagebox.showinfo("Invalid selection", "Please select a model, not the separator line.")
|
||||||
self.transcribe_button.configure(state=tk.NORMAL)
|
|
||||||
return
|
return
|
||||||
|
self.transcribe_button.configure(state=tk.DISABLED)
|
||||||
|
self.stop_button.configure(state=tk.NORMAL)
|
||||||
|
path = self.path_entry.get()
|
||||||
model = HF_MODEL_MAP.get(model_display, model_display)
|
model = HF_MODEL_MAP.get(model_display, model_display)
|
||||||
language = self.language_entry.get()
|
lang_label = self.language_combobox.get()
|
||||||
# Auto-set Swedish for KB models
|
language = WHISPER_LANGUAGES.get(lang_label, lang_label) if lang_label else None
|
||||||
is_kb_model = model_display.startswith('KB Swedish')
|
timestamps = self.timestamps_var.get()
|
||||||
# Check if the language field has the default text or is empty
|
vad_filter = self.vad_var.get()
|
||||||
if is_kb_model:
|
word_timestamps = self.word_ts_var.get()
|
||||||
language = 'sv'
|
translate = self.translate_var.get()
|
||||||
elif language == self.default_language_text or not language.strip():
|
glob_file = get_path(path)
|
||||||
language = None # This is the same as passing nothing
|
|
||||||
verbose = True # always show transcription progress in the console panel
|
|
||||||
# Show progress bar
|
|
||||||
self.progress_bar.pack(fill=tk.X, padx=5, pady=5)
|
self.progress_bar.pack(fill=tk.X, padx=5, pady=5)
|
||||||
self.progress_bar.start()
|
self.progress_bar.start()
|
||||||
# Setting path and files
|
self._parent_conn, self._child_conn = mp.Pipe(duplex=False)
|
||||||
glob_file = get_path(path)
|
self._proc = mp.Process(
|
||||||
#messagebox.showinfo("Message", "Starting transcription!")
|
target=_transcribe_worker_process,
|
||||||
# Start transcription
|
args=(self._child_conn, path, glob_file, model, language, True, timestamps),
|
||||||
|
kwargs={"vad_filter": vad_filter, "word_timestamps": word_timestamps, "translate": translate},
|
||||||
|
daemon=True,
|
||||||
|
)
|
||||||
|
self._proc.start()
|
||||||
|
self._child_conn.close() # parent doesn't write; close its write-end
|
||||||
|
self._child_conn = None
|
||||||
|
self.master.after(100, self._poll_worker)
|
||||||
|
|
||||||
|
def _poll_worker(self):
|
||||||
|
done = False
|
||||||
|
result = None
|
||||||
try:
|
try:
|
||||||
output_text = transcribe(path, glob_file, model, language, verbose)
|
while self._parent_conn and self._parent_conn.poll():
|
||||||
except UnboundLocalError:
|
msg = self._parent_conn.recv()
|
||||||
messagebox.showinfo("Files not found error!", 'Nothing found, choose another folder.')
|
if isinstance(msg, tuple) and msg[0] == '__done__':
|
||||||
|
done = True
|
||||||
|
result = msg[1]
|
||||||
|
else:
|
||||||
|
sys.stdout.write(msg)
|
||||||
|
sys.stdout.flush()
|
||||||
|
except EOFError:
|
||||||
|
# Child closed the pipe (normal completion or kill)
|
||||||
|
done = True
|
||||||
|
except Exception:
|
||||||
pass
|
pass
|
||||||
except ValueError:
|
if done or (self._proc and not self._proc.is_alive()):
|
||||||
messagebox.showinfo("Invalid language name, you might have to clear the default text to continue!")
|
if self._parent_conn:
|
||||||
# Hide progress bar
|
try:
|
||||||
|
self._parent_conn.close()
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
self._parent_conn = None
|
||||||
|
self._on_transcription_done(result)
|
||||||
|
else:
|
||||||
|
self.master.after(100, self._poll_worker)
|
||||||
|
|
||||||
|
def _on_transcription_done(self, output_text):
|
||||||
self.progress_bar.stop()
|
self.progress_bar.stop()
|
||||||
self.progress_bar.pack_forget()
|
self.progress_bar.pack_forget()
|
||||||
# Enable transcribe button
|
self.stop_button.configure(state=tk.DISABLED)
|
||||||
self.transcribe_button.configure(state=tk.NORMAL)
|
self.transcribe_button.configure(state=tk.NORMAL)
|
||||||
# Recover output text
|
if output_text:
|
||||||
try:
|
title = "Finished!" if not output_text.startswith('⚠') else "Error"
|
||||||
messagebox.showinfo("Finished!", output_text)
|
messagebox.showinfo(title, output_text)
|
||||||
except UnboundLocalError:
|
|
||||||
pass
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
# Setting custom themes
|
# Setting custom themes
|
||||||
root = customtkinter.CTk()
|
root = customtkinter.CTk()
|
||||||
|
_apply_display_scaling(root)
|
||||||
root.title("Local Transcribe with Whisper")
|
root.title("Local Transcribe with Whisper")
|
||||||
# Geometry — taller to accommodate the embedded console panel
|
# Geometry — taller to accommodate the embedded console panel
|
||||||
width, height = 550, 560
|
width, height = 550, 560
|
||||||
root.geometry('{}x{}'.format(width, height))
|
root.geometry('{}x{}'.format(width, height))
|
||||||
root.minsize(450, 480)
|
root.minsize(450, 480)
|
||||||
# Icon
|
# Icon (best-effort; ignored on platforms/builds without .ico support)
|
||||||
root.iconbitmap('images/icon.ico')
|
_set_app_icon(root)
|
||||||
# Run
|
# Run
|
||||||
app = App(root)
|
app = App(root)
|
||||||
root.mainloop()
|
root.mainloop()
|
||||||
|
|||||||
@@ -1,2 +1,3 @@
|
|||||||
faster-whisper
|
faster-whisper
|
||||||
|
mlx-whisper
|
||||||
customtkinter
|
customtkinter
|
||||||
|
|||||||
@@ -1,2 +1,4 @@
|
|||||||
Armstrong_Small_Step
|
Armstrong_Small_Step
|
||||||
[0:00:00 --> 0:00:07]: That's one small step for man, one giant leap for mankind.
|
────────────────────────────────────────
|
||||||
|
|
||||||
|
That's one small step for man, one giant leap for mankind.
|
||||||
@@ -1,2 +1,4 @@
|
|||||||
Axel_Pettersson_röstinspelning
|
Axel_Pettersson_röstinspelning
|
||||||
[0:00:00 --> 0:00:15]: Hej, jag heter Axel Pettersson, jag föddes i Örebro 1976. Jag har varit Wikipedia sen 2008 och jag har översatt röstintroduktionsprojektet till svenska.
|
────────────────────────────────────────
|
||||||
|
|
||||||
|
Hej, jag heter Axel Pettersson, jag föddes i Örebro 1976. Jag har varit Wikipedia sen 2008 och jag har översatt röstintroduktionsprojektet till svenska.
|
||||||
+384
-46
@@ -1,65 +1,210 @@
|
|||||||
import os
|
import os
|
||||||
import sys
|
import sys
|
||||||
|
import platform
|
||||||
import datetime
|
import datetime
|
||||||
|
import time
|
||||||
import site
|
import site
|
||||||
from glob import glob
|
from glob import glob
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
# CUDA setup — must happen before importing faster_whisper / ctranslate2
|
# CUDA setup — must happen before importing faster_whisper / ctranslate2
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
def _setup_cuda_dlls():
|
def _setup_cuda_libs():
|
||||||
"""Add NVIDIA pip-package DLL dirs to the DLL search path (Windows only).
|
"""Register NVIDIA pip-package lib dirs so ctranslate2 finds CUDA at runtime.
|
||||||
|
|
||||||
pip-installed nvidia-cublas-cu12 / nvidia-cudnn-cu12 place their .dll
|
pip-installed nvidia-cublas-cu12 / nvidia-cudnn-cu12 place their shared
|
||||||
files inside the site-packages tree. Python 3.8+ on Windows does NOT
|
libraries inside the site-packages tree. Neither Windows nor Linux
|
||||||
search PATH for DLLs loaded via ctypes/LoadLibrary, so we must
|
automatically search those directories, so we must register them
|
||||||
explicitly register every nvidia/*/bin and nvidia/*/lib directory using
|
explicitly:
|
||||||
os.add_dll_directory *and* prepend them to PATH (some native extensions
|
- Windows: os.add_dll_directory() + PATH
|
||||||
still rely on PATH).
|
- Linux: LD_LIBRARY_PATH (read by the dynamic linker)
|
||||||
"""
|
"""
|
||||||
if sys.platform != "win32":
|
|
||||||
return
|
|
||||||
try:
|
try:
|
||||||
for sp in site.getsitepackages():
|
sp_dirs = site.getsitepackages()
|
||||||
nvidia_root = os.path.join(sp, "nvidia")
|
except AttributeError:
|
||||||
if not os.path.isdir(nvidia_root):
|
# virtualenv without site-packages helper
|
||||||
continue
|
sp_dirs = [os.path.join(sys.prefix, "lib",
|
||||||
for pkg in os.listdir(nvidia_root):
|
"python" + ".".join(map(str, sys.version_info[:2])),
|
||||||
for sub in ("bin", "lib"):
|
"site-packages")]
|
||||||
d = os.path.join(nvidia_root, pkg, sub)
|
|
||||||
if os.path.isdir(d):
|
|
||||||
os.environ["PATH"] = d + os.pathsep + os.environ.get("PATH", "")
|
|
||||||
try:
|
|
||||||
os.add_dll_directory(d)
|
|
||||||
except (OSError, AttributeError):
|
|
||||||
pass
|
|
||||||
except Exception:
|
|
||||||
pass
|
|
||||||
|
|
||||||
_setup_cuda_dlls()
|
for sp in sp_dirs:
|
||||||
|
nvidia_root = os.path.join(sp, "nvidia")
|
||||||
|
if not os.path.isdir(nvidia_root):
|
||||||
|
continue
|
||||||
|
for pkg in os.listdir(nvidia_root):
|
||||||
|
for sub in ("bin", "lib"):
|
||||||
|
d = os.path.join(nvidia_root, pkg, sub)
|
||||||
|
if not os.path.isdir(d):
|
||||||
|
continue
|
||||||
|
if sys.platform == "win32":
|
||||||
|
os.environ["PATH"] = d + os.pathsep + os.environ.get("PATH", "")
|
||||||
|
try:
|
||||||
|
os.add_dll_directory(d)
|
||||||
|
except (OSError, AttributeError):
|
||||||
|
pass
|
||||||
|
else:
|
||||||
|
# Linux / macOS — prepend to LD_LIBRARY_PATH
|
||||||
|
ld = os.environ.get("LD_LIBRARY_PATH", "")
|
||||||
|
if d not in ld:
|
||||||
|
os.environ["LD_LIBRARY_PATH"] = d + (":" + ld if ld else "")
|
||||||
|
# Also load via ctypes so already-started process sees it
|
||||||
|
import ctypes
|
||||||
|
try:
|
||||||
|
for so in sorted(os.listdir(d)):
|
||||||
|
if so.endswith(".so") or ".so." in so:
|
||||||
|
ctypes.cdll.LoadLibrary(os.path.join(d, so))
|
||||||
|
except OSError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
_setup_cuda_libs()
|
||||||
|
|
||||||
from faster_whisper import WhisperModel
|
from faster_whisper import WhisperModel
|
||||||
|
|
||||||
|
|
||||||
def _detect_device():
|
SUPPORTED_EXTENSIONS = {
|
||||||
"""Return (device, compute_type) for the best available backend."""
|
".wav", ".mp3", ".m4a", ".flac", ".ogg", ".wma", ".aac",
|
||||||
|
".mp4", ".mkv", ".mov", ".webm", ".avi", ".mpeg", ".mpg",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# MLX model map (Apple Silicon only)
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
_MLX_MODEL_MAP = {
|
||||||
|
"tiny": "mlx-community/whisper-tiny-mlx",
|
||||||
|
"base": "mlx-community/whisper-base-mlx",
|
||||||
|
"small": "mlx-community/whisper-small-mlx",
|
||||||
|
"medium": "mlx-community/whisper-medium-mlx",
|
||||||
|
"large-v2": "mlx-community/whisper-large-v2-mlx",
|
||||||
|
"large-v3": "mlx-community/whisper-large-v3-mlx",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def detect_backend():
|
||||||
|
"""Return the best available inference backend.
|
||||||
|
|
||||||
|
Returns a dict with keys:
|
||||||
|
backend : "mlx" | "cuda" | "cpu"
|
||||||
|
device : device string for WhisperModel (cuda / cpu)
|
||||||
|
compute_type : compute type string for WhisperModel
|
||||||
|
label : human-readable label for UI display
|
||||||
|
"""
|
||||||
|
# Apple Silicon → try MLX (GPU + Neural Engine via Apple MLX)
|
||||||
|
if sys.platform == "darwin" and platform.machine() == "arm64":
|
||||||
|
try:
|
||||||
|
import mlx_whisper # noqa: F401
|
||||||
|
return {
|
||||||
|
"backend": "mlx",
|
||||||
|
"device": "cpu",
|
||||||
|
"compute_type": "int8",
|
||||||
|
"label": "MLX · Apple GPU/NPU",
|
||||||
|
}
|
||||||
|
except ImportError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
# NVIDIA CUDA
|
||||||
try:
|
try:
|
||||||
import ctranslate2
|
import ctranslate2
|
||||||
cuda_types = ctranslate2.get_supported_compute_types("cuda")
|
cuda_types = ctranslate2.get_supported_compute_types("cuda")
|
||||||
if "float16" in cuda_types:
|
if "float16" in cuda_types:
|
||||||
return "cuda", "float16"
|
return {
|
||||||
|
"backend": "cuda",
|
||||||
|
"device": "cuda",
|
||||||
|
"compute_type": "float16",
|
||||||
|
"label": "CUDA · GPU",
|
||||||
|
}
|
||||||
except Exception:
|
except Exception:
|
||||||
pass
|
pass
|
||||||
return "cpu", "int8"
|
|
||||||
|
return {
|
||||||
|
"backend": "cpu",
|
||||||
|
"device": "cpu",
|
||||||
|
"compute_type": "int8",
|
||||||
|
"label": "CPU · int8",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _decode_audio_pyav(file_path):
|
||||||
|
"""Decode any audio/video file to a float32 mono 16 kHz numpy array.
|
||||||
|
|
||||||
|
Uses PyAV (bundled FFmpeg) — no external ffmpeg binary required.
|
||||||
|
Returns (audio_array, duration_seconds).
|
||||||
|
"""
|
||||||
|
import av
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
with av.open(file_path) as container:
|
||||||
|
duration = float(container.duration) / 1_000_000 # microseconds → seconds
|
||||||
|
stream = container.streams.audio[0]
|
||||||
|
resampler = av.AudioResampler(format="fltp", layout="mono", rate=16000)
|
||||||
|
chunks = []
|
||||||
|
for frame in container.decode(stream):
|
||||||
|
for out in resampler.resample(frame):
|
||||||
|
if out:
|
||||||
|
chunks.append(out.to_ndarray()[0])
|
||||||
|
# Flush resampler
|
||||||
|
for out in resampler.resample(None):
|
||||||
|
if out:
|
||||||
|
chunks.append(out.to_ndarray()[0])
|
||||||
|
|
||||||
|
if not chunks:
|
||||||
|
return np.zeros(0, dtype=np.float32), duration
|
||||||
|
return np.concatenate(chunks, axis=0), duration
|
||||||
|
|
||||||
|
|
||||||
|
def _transcribe_mlx_file(file, mlx_model_id, language, timestamps, verbose, vad_filter=False, word_timestamps=False, translate=False):
|
||||||
|
"""Transcribe a single file with mlx-whisper (Apple GPU/NPU).
|
||||||
|
|
||||||
|
Decodes audio via PyAV (no system ffmpeg needed), then runs MLX inference.
|
||||||
|
Returns (segments_as_dicts, audio_duration_seconds).
|
||||||
|
Segments have dict keys: 'start', 'end', 'text'.
|
||||||
|
"""
|
||||||
|
import mlx_whisper
|
||||||
|
|
||||||
|
audio_array, duration = _decode_audio_pyav(file)
|
||||||
|
|
||||||
|
decode_opts = {}
|
||||||
|
if language:
|
||||||
|
decode_opts["language"] = language
|
||||||
|
if translate:
|
||||||
|
decode_opts["task"] = "translate"
|
||||||
|
if word_timestamps:
|
||||||
|
decode_opts["word_timestamps"] = True
|
||||||
|
|
||||||
|
result = mlx_whisper.transcribe(
|
||||||
|
audio_array,
|
||||||
|
path_or_hf_repo=mlx_model_id,
|
||||||
|
verbose=(True if verbose else None),
|
||||||
|
**decode_opts,
|
||||||
|
)
|
||||||
|
segments = result["segments"]
|
||||||
|
audio_duration = segments[-1]["end"] if segments else duration
|
||||||
|
return segments, audio_duration
|
||||||
|
|
||||||
|
|
||||||
|
def _srt_timestamp(seconds):
|
||||||
|
"""Convert seconds (float) to SRT timestamp format HH:MM:SS,mmm."""
|
||||||
|
ms = round(seconds * 1000)
|
||||||
|
h, ms = divmod(ms, 3_600_000)
|
||||||
|
m, ms = divmod(ms, 60_000)
|
||||||
|
s, ms = divmod(ms, 1000)
|
||||||
|
return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"
|
||||||
|
|
||||||
|
|
||||||
# Get the path
|
# Get the path
|
||||||
def get_path(path):
|
def get_path(path):
|
||||||
glob_file = glob(path + '/*')
|
all_items = glob(path + '/*')
|
||||||
return glob_file
|
media_files = []
|
||||||
|
for item in all_items:
|
||||||
|
if not os.path.isfile(item):
|
||||||
|
continue
|
||||||
|
_, ext = os.path.splitext(item)
|
||||||
|
if ext.lower() in SUPPORTED_EXTENSIONS:
|
||||||
|
media_files.append(item)
|
||||||
|
return sorted(media_files)
|
||||||
|
|
||||||
# Main function
|
# Main function
|
||||||
def transcribe(path, glob_file, model=None, language=None, verbose=False):
|
def transcribe(path, glob_file, model=None, language=None, verbose=False, timestamps=True, stop_event=None, vad_filter=False, word_timestamps=False, translate=False):
|
||||||
"""
|
"""
|
||||||
Transcribes audio files in a specified folder using faster-whisper (CTranslate2).
|
Transcribes audio files in a specified folder using faster-whisper (CTranslate2).
|
||||||
|
|
||||||
@@ -90,53 +235,204 @@ def transcribe(path, glob_file, model=None, language=None, verbose=False):
|
|||||||
SEP = "─" * 46
|
SEP = "─" * 46
|
||||||
|
|
||||||
# ── Step 1: Detect hardware ──────────────────────────────────────
|
# ── Step 1: Detect hardware ──────────────────────────────────────
|
||||||
device, compute_type = _detect_device()
|
backend_info = detect_backend()
|
||||||
print(f"⚙ Device: {device} | Compute: {compute_type}")
|
backend = backend_info["backend"]
|
||||||
|
device = backend_info["device"]
|
||||||
|
compute_type = backend_info["compute_type"]
|
||||||
|
print(f"⚙ Backend: {backend_info['label']}")
|
||||||
|
|
||||||
# ── Step 2: Load model ───────────────────────────────────────────
|
# ── Step 1b: MLX path (Apple GPU/NPU) ───────────────────────────
|
||||||
|
if backend == "mlx":
|
||||||
|
mlx_model_id = _MLX_MODEL_MAP.get(model)
|
||||||
|
if mlx_model_id is None:
|
||||||
|
print(f"⚠ Model '{model}' is not available in MLX format.")
|
||||||
|
print(" Falling back to faster-whisper on CPU (int8).")
|
||||||
|
backend = "cpu"
|
||||||
|
device, compute_type = "cpu", "int8"
|
||||||
|
else:
|
||||||
|
# ── Step 2 (MLX): load + transcribe ─────────────────────
|
||||||
|
print(f"⏳ Loading MLX model '{model}' — downloading if needed...")
|
||||||
|
print("✅ Model ready!")
|
||||||
|
print(SEP)
|
||||||
|
|
||||||
|
total_files = len(glob_file)
|
||||||
|
print(f"📂 Found {total_files} supported media file(s) in folder")
|
||||||
|
print(SEP)
|
||||||
|
|
||||||
|
if total_files == 0:
|
||||||
|
output_text = '⚠ No supported media files found — try another folder.'
|
||||||
|
print(output_text)
|
||||||
|
print(SEP)
|
||||||
|
return output_text
|
||||||
|
|
||||||
|
files_transcripted = []
|
||||||
|
file_num = 0
|
||||||
|
for file in glob_file:
|
||||||
|
if stop_event and stop_event.is_set():
|
||||||
|
print("⛔ Transcription stopped by user.")
|
||||||
|
break
|
||||||
|
title = os.path.basename(file).split('.')[0]
|
||||||
|
file_num += 1
|
||||||
|
print(f"\n{'─' * 46}")
|
||||||
|
print(f"📄 File {file_num}/{total_files}: {title}")
|
||||||
|
try:
|
||||||
|
t_start = time.time()
|
||||||
|
segments, audio_duration = _transcribe_mlx_file(
|
||||||
|
file, mlx_model_id, language, timestamps, verbose,
|
||||||
|
vad_filter=vad_filter, word_timestamps=word_timestamps,
|
||||||
|
translate=translate
|
||||||
|
)
|
||||||
|
os.makedirs('{}/transcriptions'.format(path), exist_ok=True)
|
||||||
|
segment_list = []
|
||||||
|
txt_path = "{}/transcriptions/{}.txt".format(path, title)
|
||||||
|
srt_path = "{}/transcriptions/{}.srt".format(path, title)
|
||||||
|
with open(txt_path, 'w', encoding='utf-8') as f, \
|
||||||
|
open(srt_path, 'w', encoding='utf-8') as srt_f:
|
||||||
|
f.write(title)
|
||||||
|
f.write('\n' + '─' * 40 + '\n')
|
||||||
|
for idx, seg in enumerate(segments, start=1):
|
||||||
|
if stop_event and stop_event.is_set():
|
||||||
|
break
|
||||||
|
text = seg["text"].strip()
|
||||||
|
if timestamps:
|
||||||
|
start_ts = str(datetime.timedelta(seconds=seg["start"]))
|
||||||
|
end_ts = str(datetime.timedelta(seconds=seg["end"]))
|
||||||
|
f.write('\n[{} --> {}] {}'.format(start_ts, end_ts, text))
|
||||||
|
else:
|
||||||
|
f.write('\n{}'.format(text))
|
||||||
|
srt_f.write(f'{idx}\n{_srt_timestamp(seg["start"])} --> {_srt_timestamp(seg["end"])}\n{text}\n\n')
|
||||||
|
f.flush()
|
||||||
|
srt_f.flush()
|
||||||
|
if verbose:
|
||||||
|
print(" [%.2fs → %.2fs] %s" % (seg["start"], seg["end"], seg["text"]))
|
||||||
|
else:
|
||||||
|
print(" Transcribed up to %.0fs..." % seg["end"], end='\r')
|
||||||
|
segment_list.append(seg)
|
||||||
|
elapsed = time.time() - t_start
|
||||||
|
elapsed_min = elapsed / 60.0
|
||||||
|
audio_min = audio_duration / 60.0
|
||||||
|
ratio = audio_duration / elapsed if elapsed > 0 else float('inf')
|
||||||
|
print(f"✅ Done — saved to transcriptions/{title}.txt")
|
||||||
|
print(f"⏱ Transcribed {audio_min:.1f} min of audio in {elapsed_min:.1f} min ({ratio:.1f}x realtime)")
|
||||||
|
files_transcripted.append(segment_list)
|
||||||
|
except Exception as exc:
|
||||||
|
print(f"⚠ Could not decode '{os.path.basename(file)}', skipping.")
|
||||||
|
print(f" Reason: {exc}")
|
||||||
|
|
||||||
|
print(f"\n{SEP}")
|
||||||
|
if files_transcripted:
|
||||||
|
output_text = f"✅ Finished! {len(files_transcripted)} file(s) transcribed.\n Saved in: {path}/transcriptions"
|
||||||
|
else:
|
||||||
|
output_text = '⚠ No files eligible for transcription — try another folder.'
|
||||||
|
print(output_text)
|
||||||
|
print(SEP)
|
||||||
|
return output_text
|
||||||
|
|
||||||
|
# ── Step 2: Load model (faster-whisper / CTranslate2) ───────────
|
||||||
print(f"⏳ Loading model '{model}' — downloading if needed...")
|
print(f"⏳ Loading model '{model}' — downloading if needed...")
|
||||||
whisper_model = WhisperModel(model, device=device, compute_type=compute_type)
|
try:
|
||||||
|
whisper_model = WhisperModel(model, device=device, compute_type=compute_type)
|
||||||
|
except Exception as exc:
|
||||||
|
err = str(exc).lower()
|
||||||
|
cuda_runtime_missing = (
|
||||||
|
device == "cuda"
|
||||||
|
and (
|
||||||
|
"libcublas" in err
|
||||||
|
or "libcudnn" in err
|
||||||
|
or "cuda" in err
|
||||||
|
or "cannot be loaded" in err
|
||||||
|
or "not found" in err
|
||||||
|
)
|
||||||
|
)
|
||||||
|
if not cuda_runtime_missing:
|
||||||
|
raise
|
||||||
|
print("⚠ CUDA runtime not available; falling back to CPU (int8).")
|
||||||
|
print(f" Reason: {exc}")
|
||||||
|
device, compute_type = "cpu", "int8"
|
||||||
|
whisper_model = WhisperModel(model, device=device, compute_type=compute_type)
|
||||||
print("✅ Model ready!")
|
print("✅ Model ready!")
|
||||||
print(SEP)
|
print(SEP)
|
||||||
|
|
||||||
# ── Step 3: Transcribe files ─────────────────────────────────────
|
# ── Step 3: Transcribe files ─────────────────────────────────────
|
||||||
total_files = len(glob_file)
|
total_files = len(glob_file)
|
||||||
print(f"📂 Found {total_files} item(s) in folder")
|
print(f"📂 Found {total_files} supported media file(s) in folder")
|
||||||
print(SEP)
|
print(SEP)
|
||||||
|
|
||||||
|
if total_files == 0:
|
||||||
|
output_text = '⚠ No supported media files found — try another folder.'
|
||||||
|
print(output_text)
|
||||||
|
print(SEP)
|
||||||
|
return output_text
|
||||||
|
|
||||||
files_transcripted = []
|
files_transcripted = []
|
||||||
file_num = 0
|
file_num = 0
|
||||||
for file in glob_file:
|
for file in glob_file:
|
||||||
|
if stop_event and stop_event.is_set():
|
||||||
|
print("⛔ Transcription stopped by user.")
|
||||||
|
break
|
||||||
title = os.path.basename(file).split('.')[0]
|
title = os.path.basename(file).split('.')[0]
|
||||||
file_num += 1
|
file_num += 1
|
||||||
print(f"\n{'─' * 46}")
|
print(f"\n{'─' * 46}")
|
||||||
print(f"📄 File {file_num}/{total_files}: {title}")
|
print(f"📄 File {file_num}/{total_files}: {title}")
|
||||||
try:
|
try:
|
||||||
|
t_start = time.time()
|
||||||
segments, info = whisper_model.transcribe(
|
segments, info = whisper_model.transcribe(
|
||||||
file,
|
file,
|
||||||
language=language,
|
language=language,
|
||||||
beam_size=5
|
beam_size=5,
|
||||||
|
task="translate" if translate else "transcribe",
|
||||||
|
vad_filter=vad_filter,
|
||||||
|
word_timestamps=word_timestamps,
|
||||||
)
|
)
|
||||||
|
audio_duration = info.duration # seconds
|
||||||
# Make folder if missing
|
# Make folder if missing
|
||||||
os.makedirs('{}/transcriptions'.format(path), exist_ok=True)
|
os.makedirs('{}/transcriptions'.format(path), exist_ok=True)
|
||||||
# Stream segments as they are decoded
|
# Stream segments as they are decoded
|
||||||
segment_list = []
|
segment_list = []
|
||||||
with open("{}/transcriptions/{}.txt".format(path, title), 'w', encoding='utf-8') as f:
|
txt_path = "{}/transcriptions/{}.txt".format(path, title)
|
||||||
|
srt_path = "{}/transcriptions/{}.srt".format(path, title)
|
||||||
|
with open(txt_path, 'w', encoding='utf-8') as f, \
|
||||||
|
open(srt_path, 'w', encoding='utf-8') as srt_f:
|
||||||
f.write(title)
|
f.write(title)
|
||||||
for seg in segments:
|
f.write('\n' + '─' * 40 + '\n')
|
||||||
start_ts = str(datetime.timedelta(seconds=seg.start))
|
for idx, seg in enumerate(segments, start=1):
|
||||||
end_ts = str(datetime.timedelta(seconds=seg.end))
|
if stop_event and stop_event.is_set():
|
||||||
f.write('\n[{} --> {}]:{}'.format(start_ts, end_ts, seg.text))
|
break
|
||||||
|
text = seg.text.strip()
|
||||||
|
if timestamps:
|
||||||
|
start_ts = str(datetime.timedelta(seconds=seg.start))
|
||||||
|
end_ts = str(datetime.timedelta(seconds=seg.end))
|
||||||
|
f.write('\n[{} --> {}] {}'.format(start_ts, end_ts, text))
|
||||||
|
else:
|
||||||
|
f.write('\n{}'.format(text))
|
||||||
|
# Use word-level timestamps for SRT if available
|
||||||
|
if word_timestamps and hasattr(seg, 'words') and seg.words:
|
||||||
|
for w_idx, word in enumerate(seg.words, start=1):
|
||||||
|
w_text = word.word.strip()
|
||||||
|
if not w_text:
|
||||||
|
continue
|
||||||
|
w_start = _srt_timestamp(word.start)
|
||||||
|
w_end = _srt_timestamp(word.end)
|
||||||
|
srt_f.write(f'{idx}.{w_idx}\n{w_start} --> {w_end}\n{w_text}\n\n')
|
||||||
|
else:
|
||||||
|
srt_f.write(f'{idx}\n{_srt_timestamp(seg.start)} --> {_srt_timestamp(seg.end)}\n{text}\n\n')
|
||||||
f.flush()
|
f.flush()
|
||||||
|
srt_f.flush()
|
||||||
if verbose:
|
if verbose:
|
||||||
print(" [%.2fs → %.2fs] %s" % (seg.start, seg.end, seg.text))
|
print(" [%.2fs → %.2fs] %s" % (seg.start, seg.end, seg.text))
|
||||||
else:
|
else:
|
||||||
print(" Transcribed up to %.0fs..." % seg.end, end='\r')
|
print(" Transcribed up to %.0fs..." % seg.end, end='\r')
|
||||||
segment_list.append(seg)
|
segment_list.append(seg)
|
||||||
|
elapsed = time.time() - t_start
|
||||||
|
elapsed_min = elapsed / 60.0
|
||||||
|
audio_min = audio_duration / 60.0
|
||||||
|
ratio = audio_duration / elapsed if elapsed > 0 else float('inf')
|
||||||
print(f"✅ Done — saved to transcriptions/{title}.txt")
|
print(f"✅ Done — saved to transcriptions/{title}.txt")
|
||||||
|
print(f"⏱ Transcribed {audio_min:.1f} min of audio in {elapsed_min:.1f} min ({ratio:.1f}x realtime)")
|
||||||
files_transcripted.append(segment_list)
|
files_transcripted.append(segment_list)
|
||||||
except Exception:
|
except Exception as exc:
|
||||||
print('⚠ Not a valid audio/video file, skipping.')
|
print(f"⚠ Could not decode '{os.path.basename(file)}', skipping.")
|
||||||
|
print(f" Reason: {exc}")
|
||||||
|
|
||||||
# ── Summary ──────────────────────────────────────────────────────
|
# ── Summary ──────────────────────────────────────────────────────
|
||||||
print(f"\n{SEP}")
|
print(f"\n{SEP}")
|
||||||
@@ -147,3 +443,45 @@ def transcribe(path, glob_file, model=None, language=None, verbose=False):
|
|||||||
print(output_text)
|
print(output_text)
|
||||||
print(SEP)
|
print(SEP)
|
||||||
return output_text
|
return output_text
|
||||||
|
|
||||||
|
|
||||||
|
def _transcribe_worker_process(conn, path, glob_file, model, language, verbose, timestamps, vad_filter=False, word_timestamps=False, translate=False):
|
||||||
|
"""Child-process entry point for the UI's multiprocessing backend.
|
||||||
|
|
||||||
|
Redirects stdout/stderr → pipe connection so the main process can display
|
||||||
|
output in the console panel. The main process sends SIGTERM/SIGKILL to
|
||||||
|
stop this process immediately, including any in-progress download or inference.
|
||||||
|
"""
|
||||||
|
import sys
|
||||||
|
|
||||||
|
class _PipeWriter:
|
||||||
|
def __init__(self, c):
|
||||||
|
self.c = c
|
||||||
|
|
||||||
|
def write(self, text):
|
||||||
|
if text:
|
||||||
|
try:
|
||||||
|
self.c.send(text)
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
|
def flush(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
writer = _PipeWriter(conn)
|
||||||
|
sys.stdout = writer
|
||||||
|
sys.stderr = writer
|
||||||
|
|
||||||
|
result = '⚠ No output produced.'
|
||||||
|
try:
|
||||||
|
result = transcribe(path, glob_file, model, language, verbose, timestamps,
|
||||||
|
vad_filter=vad_filter, word_timestamps=word_timestamps,
|
||||||
|
translate=translate)
|
||||||
|
except Exception as exc:
|
||||||
|
result = f'⚠ Unexpected error: {exc}'
|
||||||
|
finally:
|
||||||
|
try:
|
||||||
|
conn.send(('__done__', result))
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
conn.close()
|
||||||
|
|||||||
Reference in New Issue
Block a user