Step-by-Step Vocabulary Extraction Guide

Complete 7-step walkthrough for extracting vocabulary from YouTube videos. From transcript extraction through deployment, with detailed explanations and troubleshooting.

Dictionary Vocabulary Extraction - Step-by-Step Guide

🚨 IMPORTANT: Environment Setup

For AI Agents / Automated Commands

Before running any Python-based transcript extraction commands, you MUST:

  • Run conda-bash alias to ensure python commands work in scripts
  • Once python profile is read, base environment will be activated. Now you can run the conda commands, then you can activate ags-dictionary environment
  1. Activate conda bash profile:

    1source ~/miniconda3/etc/profile.d/conda.sh
    
  2. Activate the Python environment:

    1conda activate ags-dictionary
    
  3. Then run your commands:

    1python -m youtube_transcript_api VIDEO_ID --format text
    

OR use the automated script (which handles this for you):

1./scripts/dictionary/dict-step1-transcript.sh VIDEO_URL TOPIC_NAME

📋 Complete Workflow

Step 1: Extract Transcript (Automated) ⭐

Easy Way - Using Automated Script:

1# From project root
2./scripts/dictionary/dict-step1-transcript.sh "https://www.youtube.com/watch?v=9M_dq_0ljsc" capitalism

What it does:

  • ✅ Automatically activates conda environment
  • ✅ Extracts transcript from YouTube
  • ✅ Saves to .prompts/dictionary/transcript/TOPIC-transcript.txt
  • ✅ Shows word count and confirmation

Manual Way (if you prefer):

1# 1. Activate conda
2source ~/miniconda3/etc/profile.d/conda.sh
3conda activate ags-dictionary
4
5# 2. Extract transcript
6python -m youtube_transcript_api VIDEO_ID --format text > .prompts/dictionary/transcript/TOPIC-transcript.txt
7
8# 3. Check word count
9wc -w .prompts/dictionary/transcript/TOPIC-transcript.txt

Step 2: Extract Vocabulary with Copilot

Now that you have the transcript, use GitHub Copilot to extract vocabulary.

Option A: In VS Code Copilot Chat

  1. Open Copilot Chat (Ctrl+Shift+I or Cmd+Shift+I)

  2. Use this prompt (copy-paste ready):

 1I need help extracting vocabulary from a video transcript for my English learning dictionary.
 2
 3**Task**: Analyze this video transcript about "capitalism and economic systems" and extract 25 challenging English words suitable for an advanced ESL learner (first language: Urdu).
 4
 5**Selection Criteria**:
 6- Difficulty level: 6-9 out of 10
 7- Important for understanding economics/politics/systems
 8- Not common everyday words
 9- Academic or domain-specific terms
10- Useful for intellectual discussions
11
12**For each word, provide**:
131. word: the English word (lowercase)
142. part_of_speech: (noun, verb, adjective, adverb, phrase, technical-term, etc.)
153. urdu_meaning: Urdu translation in Urdu script
164. example_en: A clear example sentence from the transcript or similar context
175. example_ur: Natural, contextual Urdu translation of the example
186. additional_example_ur: (optional) Another Urdu example showing different usage
19
20**Output Format**: Valid YAML array only, no markdown formatting, no explanations.
21
22**Example Entry**:
23```yaml
24- word: accumulate
25  part_of_speech: verb
26  urdu_meaning: جمع کرنا، اکٹھا کرنا
27  example_en: Capitalism encourages us to accumulate wealth and resources.
28  example_ur: سرمایہ داری ہمیں دولت اور وسائل جمع کرنے کی ترغیب دیتی ہے۔
29  additional_example_ur: وقت کے ساتھ ساتھ دولت جمع ہوتی جاتی ہے۔

Transcript:

  • [OPEN FILE: .prompts/dictionary/transcript/capitalism-transcript.txt AND PASTE CONTENT HERE]

Please extract the vocabulary now in YAML format.

  1. Copilot will generate YAML - copy the output

Option B: In ChatGPT (GPT-4o/4.1)

  1. Go to https://chat.openai.com
  2. Paste the same prompt above
  3. Replace [OPEN FILE...] with the actual transcript content
  4. Copy the YAML output

Step 3: Save the YAML

Create the data directory and file:

1# Create directory
2mkdir -p data/dictionary/capitalism
3
4# Save YAML (paste Copilot output)
5cat > data/dictionary/capitalism/vocabulary.yaml
6# Paste the YAML here, then press Ctrl+D

Or use VS Code:

  1. Create file: data/dictionary/capitalism/vocabulary.yaml
  2. Paste the YAML from Copilot
  3. Save (Ctrl+S)

Step 4: Validate YAML

1# Check YAML is valid
2python -c "import yaml; print('✅ Valid YAML!'); print(f'Entries: {len(yaml.safe_load(open(\"data/dictionary/capitalism/vocabulary.yaml\")))}')"

Step 5: Create Hugo Content Page

1# Create page from template
2hugo new content/my_dictionary/capitalism/index.md --kind dictionary

Then edit the file to:

  1. Update title: “Capitalism and Economic Systems Vocabulary”
  2. Update description
13. Update shortcode reference: {{/< vocabulary-accordion "dictionary.capitalism.vocabulary">}}
  1. Add source info

Step 6: Test Locally

1# Start dev server
2npm run dev:memory
3
4# Open browser to:
5# http://localhost:1313/docs/dictionary/capitalism/

Step 7: Review and Commit

  1. ✅ Check words are appropriate difficulty
  2. ✅ Review Urdu translations (edit if needed)
  3. ✅ Verify accordion works
  4. ✅ Commit to git:
1git add - A 
2git cm "message"

🔧 Automated Scripts Available

Scripts in Project Root

  1. scripts/dictionary/dict-step1-transcript.sh ⭐ NEW

    • Automates transcript extraction
    • Handles conda environment automatically
    • Usage: ./scripts/dictionary/dict-step1-transcript.sh VIDEO_URL TOPIC_NAME
  2. scripts/dictionary/dict-extract.sh

    • Full automation with OpenAI API (requires API key)
    • Usage: ./scripts/dictionary/dict-extract.sh --video-url "URL" --topic "topic"

📚 Quick Reference

Example: Complete Flow

 1# Step 1: Extract transcript (automated)
 2./scripts/dictionary/dict-step1-transcript.sh "https://www.youtube.com/watch?v=9M_dq_0ljsc" capitalism
 3
 4# Step 2: Open transcript file
 5code .prompts/dictionary/transcript/capitalism-transcript.txt
 6
 7# Step 3: Use Copilot Chat to extract vocabulary (see prompt above)
 8
 9# Step 4: Save YAML
10mkdir -p data/dictionary/capitalism
11cat > data/dictionary/capitalism/vocabulary.yaml
12# Paste YAML, Ctrl+D
13
14# Step 5: Create Hugo page
15hugo new content/docs/dictionary/capitalism/index.md --kind dictionary
16
17# Step 6: Edit Hugo page to reference "dictionary.capitalism.vocabulary"
18
19# Step 7: Test
20npm run dev:memory
21
22# Step 8: Commit
23git add data/dictionary/capitalism/ content/docs/dictionary/capitalism/
24git commit -m "Add capitalism vocabulary"

Total time: 15-20 minutes
Cost: $0 (using Copilot)


🐛 Troubleshooting

Error: “conda: command not found”

Solution: Conda path might be different. Try:

1source ~/anaconda3/etc/profile.d/conda.sh  # If using Anaconda
2# or
3source ~/opt/miniconda3/etc/profile.d/conda.sh  # Alternative location

Error: “ags-dictionary environment not found”

Solution: Create the environment:

1conda create -n ags-dictionary python=3.11
2conda activate ags-dictionary
3pip install youtube-transcript-api PyYAML openai python-dotenv

Error: “Could not extract transcript”

Reasons:

  • Video doesn’t have captions (try different video)
  • Video ID is wrong (check URL)
  • Private or restricted video

💡 Tips

For Better Vocabulary Selection

In your Copilot prompt, you can customize:

1Focus on words related to:
2- Economic systems
3- Political theory
4- Social structures
5- Academic discourse
6
7Difficulty level: 7-9 (very advanced)
8Word count: 30 (more words)

For Better Urdu Translations

Ask Copilot to:

1Use formal Urdu for academic terms
2Use contemporary vocabulary
3Avoid overly archaic expressions
4Include context in examples

📞 Need Help?

  • Detailed prompts: See
  • Full guide: See
  • Quick commands: See

Created: 2026-05-05
Status: Production Ready
Recommended: Use Step 1 script + Copilot for best experience