Video-to-Vocabulary Workflow - How To Guide

Step-by-step guide for extracting vocabulary from videos using AI

Video-to-Vocabulary: Practical How-To Guide

🎯 Quick Overview

This guide shows you how to automatically extract difficult English vocabulary from videos (YouTube, courses, etc.) and create dictionary entries with Urdu translations.


📋 Prerequisites

1. Install Required Tools

1# Create Python virtual environment
2cd scripts/dictionary
3python3 -m venv .venv
4source .venv/bin/activate  # On Linux/Mac
5# .venv\Scripts\activate  # On Windows
6
7# Install dependencies
8pip install youtube-transcript-api openai pyyaml nltk wordfreq

2. Get API Keys

3. Configure Environment

1# Create .env file in scripts/dictionary/
2cat > scripts/dictionary/.env << EOF
3OPENAI_API_KEY=your_key_here
4YOUTUBE_API_KEY=your_key_here  # Optional
5EOF

🚀 Method 1: Using GitHub Copilot Chat (Simplest)

Step 1: Get Video Transcript

1# In terminal or use the provided script
2youtube-transcript-api VIDEO_ID > transcript.txt

Step 2: Use Copilot Prompt

I have a video transcript about [TOPIC]. Please:

1. Read this transcript and identify 20-30 difficult English words that would be challenging for an advanced learner
2. For each word provide:
   - The word
   - Part of speech
   - A simple English definition
   - The sentence from the transcript where it appears
   - An Urdu translation of the word
   - An Urdu translation of the example sentence
3. Format the output as YAML following this structure:

[Paste your YAML template here]

Transcript:
[Paste transcript here]

Step 3: Review & Save

  • Copy the generated YAML
  • Paste into data/dictionary/[topic]/vocabulary.yaml
  • Review and edit as needed

🤖 Method 2: Using Python Script (Automated)

Basic Usage

1# From your blog root directory
2python scripts/dictionary/extract_vocabulary.py \
3  --video-url "https://youtube.com/watch?v=dQw4w9WgXcQ" \
4  --topic "rag-course" \
5  --output-dir data/dictionary/rag-course/

What It Does

  1. ✅ Fetches video transcript from YouTube
  2. ✅ Uses GPT-4 to identify difficult words
  3. ✅ Translates to Urdu using AI
  4. ✅ Generates example sentences
  5. ✅ Creates properly formatted YAML file
  6. ✅ Optionally creates Hugo content page

Advanced Options

 1# Control number of words
 2python scripts/dictionary/extract_vocabulary.py \
 3  --video-url "URL" \
 4  --max-words 30 \
 5  --difficulty-threshold 7
 6
 7# Interactive review mode
 8python scripts/dictionary/extract_vocabulary.py \
 9  --video-url "URL" \
10  --interactive
11
12# Create complete Hugo page
13python scripts/dictionary/extract_vocabulary.py \
14  --video-url "URL" \
15  --topic "my-course" \
16  --create-hugo-page

📝 Method 3: Manual with AI Assistance

Step 1: Get Transcript

For YouTube:

1# Install youtube-transcript-api
2pip install youtube-transcript-api
3
4# Get transcript
5youtube-transcript-api VIDEO_ID --format text > transcript.txt

For Other Videos:

  • Use Whisper (OpenAI’s speech-to-text)
  • Manual transcription
  • Video platform’s built-in transcript

Step 2: Create Prompt for ChatGPT/Claude

Use this prompt template:

I'm building a vocabulary list from a technical video about [TOPIC].

Task: Analyze this transcript and extract 25 challenging English words suitable for an advanced ESL learner (first language: Urdu).

For each word provide:
1. word: [the word]
2. part_of_speech: [noun/verb/adjective/etc]
3. urdu_meaning: [Urdu translation]
4. example_en: [sentence from transcript or a clear example]
5. example_ur: [Urdu translation of the example]
6. additional_example_ur: [optional: another Urdu example]

Format as valid YAML following this structure:

```yaml
- word: example
  part_of_speech: noun
  urdu_meaning: مثال
  example_en: This is an example sentence.
  example_ur: یہ ایک مثالی جملہ ہے۔

Selection criteria:

  • Difficulty level: 6-9 out of 10
  • Important for understanding the topic
  • Not common everyday words
  • Technical or domain-specific terms
  • Useful for academic/professional contexts

Transcript: [PASTE YOUR TRANSCRIPT HERE]


### Step 3: Process Output

1. Copy the YAML output
2. Validate YAML syntax
3. Save to `data/dictionary/[topic]/vocabulary.yaml`
4. Create corresponding Hugo page

---

## 🛠️ Workflow Integration

### Complete End-to-End Process

```bash
# 1. Extract vocabulary
python scripts/dictionary/extract_vocabulary.py \
  --video-url "https://youtube.com/watch?v=VIDEO_ID" \
  --topic "my-new-topic" \
  --create-hugo-page

# 2. Review generated files
code data/dictionary/my-new-topic/vocabulary.yaml
code content/docs/dictionary/my-new-topic/index.md

# 3. Edit as needed
# Make manual corrections, add notes, etc.

# 4. Test locally
npm run dev:memory

# 5. View in browser
# Go to: http://localhost:1313/docs/dictionary/my-new-topic/

# 6. Commit when satisfied
git add data/dictionary/my-new-topic/
git add content/docs/dictionary/my-new-topic/
git commit -m "Add vocabulary from [video title]"

💡 Best Practices

1. Quality Control

  • Always review AI-generated translations: GPT-4 is good but not perfect for Urdu
  • Check example sentences: Ensure they’re contextual and clear
  • Verify Urdu script: Make sure diacritics are correct if needed
  • Test pronunciation: Read Urdu aloud to check naturalness

2. Word Selection

  • Focus on utility: Choose words you’ll actually use
  • Balance difficulty: Mix challenging words with slightly easier ones
  • Context matters: Words should relate to your learning goals
  • Avoid redundancy: Check existing vocabulary to prevent duplicates

3. Organization

  • One topic per video/course: Keep sources separate
  • Consistent naming: Use kebab-case for folder names
  • Add metadata: Always include source information in frontmatter
  • Tag appropriately: Use relevant tags for discoverability

4. Maintenance

  • Regular review: Go through vocabulary weekly
  • Update translations: Improve Urdu translations as you learn
  • Add notes: Include personal mnemonics or usage notes
  • Link related words: Cross-reference similar terms

📊 Tracking Your Progress

Create a Learning Log

Add to your topic’s index.md:

 1## Learning Statistics
 2
 3- **Date Started**: 2026-05-05
 4- **Total Words**: 45
 5- **Videos Processed**: 3
 6- **Mastery Level**: 60%
 7- **Review Date**: 2026-05-12
 8
 9## Source Videos
10
111. [Video Title](URL) - 15 words
122. [Video Title](URL) - 18 words
133. [Video Title](URL) - 12 words

Integration with Anki

Since you have an Anki system, consider:

1# Convert dictionary to Anki deck
2python scripts/dictionary/export_to_anki.py \
3  --input data/dictionary/my-topic/ \
4  --output anki-decks/vocabulary-my-topic.apkg

🚨 Troubleshooting

Issue: Can’t Get YouTube Transcript

Solution 1: Check if captions are available

1from youtube_transcript_api import YouTubeTranscriptApi
2YouTubeTranscriptApi.list_transcripts('VIDEO_ID')

Solution 2: Use Whisper for local transcription

1pip install openai-whisper
2whisper video.mp4 --language en --model medium

Issue: Poor Urdu Translations

Solution: Use a specialized prompt

You are a professional Urdu translator. Translate this English word and sentence to natural, modern Urdu.

Word: [word]
Sentence: [sentence]

Guidelines:
- Use contemporary Urdu vocabulary
- Maintain technical accuracy
- Ensure natural sentence flow
- Include proper Urdu grammar

Issue: Too Many/Few Words Extracted

Solution: Adjust difficulty threshold in script or prompt

1--difficulty-threshold 7  # Higher = fewer, harder words
2--max-words 20            # Explicit limit

Issue: YAML Syntax Errors

Solution: Validate YAML before saving

1python -c "import yaml; yaml.safe_load(open('vocabulary.yaml'))"

Or use an online validator: https://www.yamllint.com/


📚 Example Workflows

Workflow A: YouTube Course (10 videos)

 1# Process entire playlist
 2for video_id in $(cat playlist_ids.txt); do
 3  python scripts/dictionary/extract_vocabulary.py \
 4    --video-url "https://youtube.com/watch?v=$video_id" \
 5    --topic "my-course" \
 6    --append  # Add to existing file
 7done
 8
 9# Review all at once
10code data/dictionary/my-course/vocabulary.yaml

Workflow B: Single Documentary

1# High-quality processing
2python scripts/dictionary/extract_vocabulary.py \
3  --video-url "URL" \
4  --topic "documentary-name" \
5  --difficulty-threshold 8 \
6  --max-words 25 \
7  --interactive \
8  --create-hugo-page

Workflow C: Conference Talk

1# Focus on technical terms
2python scripts/dictionary/extract_vocabulary.py \
3  --video-url "URL" \
4  --topic "conference-talk" \
5  --focus-technical \
6  --include-acronyms

🎯 Next Steps

  1. Choose your method: Script, Copilot, or manual
  2. Try with one video: Start small
  3. Review quality: Check translations and examples
  4. Iterate: Improve prompts/scripts based on results
  5. Scale up: Process more videos efficiently
  6. Share: Consider sharing your vocabulary lists

📞 Need Help?

  • Check the main documentation: complete-guide.md
  • Review examples: data/dictionary/army-and-war/
  • Ask Copilot for clarification
  • Test incrementally and iterate

Last Updated: 2026-05-05 Status: Ready to Use