Step-by-step guide for extracting vocabulary from videos using AI
This guide shows you how to automatically extract difficult English vocabulary from videos (YouTube, courses, etc.) and create dictionary entries with Urdu translations.
1# Create Python virtual environment
2cd scripts/dictionary
3python3 -m venv .venv
4source .venv/bin/activate # On Linux/Mac
5# .venv\Scripts\activate # On Windows
6
7# Install dependencies
8pip install youtube-transcript-api openai pyyaml nltk wordfreq
1# Create .env file in scripts/dictionary/
2cat > scripts/dictionary/.env << EOF
3OPENAI_API_KEY=your_key_here
4YOUTUBE_API_KEY=your_key_here # Optional
5EOF
1# In terminal or use the provided script
2youtube-transcript-api VIDEO_ID > transcript.txt
I have a video transcript about [TOPIC]. Please:
1. Read this transcript and identify 20-30 difficult English words that would be challenging for an advanced learner
2. For each word provide:
- The word
- Part of speech
- A simple English definition
- The sentence from the transcript where it appears
- An Urdu translation of the word
- An Urdu translation of the example sentence
3. Format the output as YAML following this structure:
[Paste your YAML template here]
Transcript:
[Paste transcript here]
data/dictionary/[topic]/vocabulary.yaml1# From your blog root directory
2python scripts/dictionary/extract_vocabulary.py \
3 --video-url "https://youtube.com/watch?v=dQw4w9WgXcQ" \
4 --topic "rag-course" \
5 --output-dir data/dictionary/rag-course/
1# Control number of words
2python scripts/dictionary/extract_vocabulary.py \
3 --video-url "URL" \
4 --max-words 30 \
5 --difficulty-threshold 7
6
7# Interactive review mode
8python scripts/dictionary/extract_vocabulary.py \
9 --video-url "URL" \
10 --interactive
11
12# Create complete Hugo page
13python scripts/dictionary/extract_vocabulary.py \
14 --video-url "URL" \
15 --topic "my-course" \
16 --create-hugo-page
For YouTube:
1# Install youtube-transcript-api
2pip install youtube-transcript-api
3
4# Get transcript
5youtube-transcript-api VIDEO_ID --format text > transcript.txt
For Other Videos:
Use this prompt template:
I'm building a vocabulary list from a technical video about [TOPIC].
Task: Analyze this transcript and extract 25 challenging English words suitable for an advanced ESL learner (first language: Urdu).
For each word provide:
1. word: [the word]
2. part_of_speech: [noun/verb/adjective/etc]
3. urdu_meaning: [Urdu translation]
4. example_en: [sentence from transcript or a clear example]
5. example_ur: [Urdu translation of the example]
6. additional_example_ur: [optional: another Urdu example]
Format as valid YAML following this structure:
```yaml
- word: example
part_of_speech: noun
urdu_meaning: مثال
example_en: This is an example sentence.
example_ur: یہ ایک مثالی جملہ ہے۔
Selection criteria:
Transcript: [PASTE YOUR TRANSCRIPT HERE]
### Step 3: Process Output
1. Copy the YAML output
2. Validate YAML syntax
3. Save to `data/dictionary/[topic]/vocabulary.yaml`
4. Create corresponding Hugo page
---
## 🛠️ Workflow Integration
### Complete End-to-End Process
```bash
# 1. Extract vocabulary
python scripts/dictionary/extract_vocabulary.py \
--video-url "https://youtube.com/watch?v=VIDEO_ID" \
--topic "my-new-topic" \
--create-hugo-page
# 2. Review generated files
code data/dictionary/my-new-topic/vocabulary.yaml
code content/docs/dictionary/my-new-topic/index.md
# 3. Edit as needed
# Make manual corrections, add notes, etc.
# 4. Test locally
npm run dev:memory
# 5. View in browser
# Go to: http://localhost:1313/docs/dictionary/my-new-topic/
# 6. Commit when satisfied
git add data/dictionary/my-new-topic/
git add content/docs/dictionary/my-new-topic/
git commit -m "Add vocabulary from [video title]"
Add to your topic’s index.md:
1## Learning Statistics
2
3- **Date Started**: 2026-05-05
4- **Total Words**: 45
5- **Videos Processed**: 3
6- **Mastery Level**: 60%
7- **Review Date**: 2026-05-12
8
9## Source Videos
10
111. [Video Title](URL) - 15 words
122. [Video Title](URL) - 18 words
133. [Video Title](URL) - 12 words
Since you have an Anki system, consider:
1# Convert dictionary to Anki deck
2python scripts/dictionary/export_to_anki.py \
3 --input data/dictionary/my-topic/ \
4 --output anki-decks/vocabulary-my-topic.apkg
Solution 1: Check if captions are available
1from youtube_transcript_api import YouTubeTranscriptApi
2YouTubeTranscriptApi.list_transcripts('VIDEO_ID')
Solution 2: Use Whisper for local transcription
1pip install openai-whisper
2whisper video.mp4 --language en --model medium
Solution: Use a specialized prompt
You are a professional Urdu translator. Translate this English word and sentence to natural, modern Urdu.
Word: [word]
Sentence: [sentence]
Guidelines:
- Use contemporary Urdu vocabulary
- Maintain technical accuracy
- Ensure natural sentence flow
- Include proper Urdu grammar
Solution: Adjust difficulty threshold in script or prompt
1--difficulty-threshold 7 # Higher = fewer, harder words
2--max-words 20 # Explicit limit
Solution: Validate YAML before saving
1python -c "import yaml; yaml.safe_load(open('vocabulary.yaml'))"
Or use an online validator: https://www.yamllint.com/
1# Process entire playlist
2for video_id in $(cat playlist_ids.txt); do
3 python scripts/dictionary/extract_vocabulary.py \
4 --video-url "https://youtube.com/watch?v=$video_id" \
5 --topic "my-course" \
6 --append # Add to existing file
7done
8
9# Review all at once
10code data/dictionary/my-course/vocabulary.yaml
1# High-quality processing
2python scripts/dictionary/extract_vocabulary.py \
3 --video-url "URL" \
4 --topic "documentary-name" \
5 --difficulty-threshold 8 \
6 --max-words 25 \
7 --interactive \
8 --create-hugo-page
1# Focus on technical terms
2python scripts/dictionary/extract_vocabulary.py \
3 --video-url "URL" \
4 --topic "conference-talk" \
5 --focus-technical \
6 --include-acronyms
data/dictionary/army-and-war/Last Updated: 2026-05-05 Status: Ready to Use