A comprehensive data-driven analysis of Super Bowl commercials to develop a strategic advertising approach for Forge & Field's new Rogue Ridge personal care product line.
- Client: Forge & Field Brands
- Product: Rogue Ridge (Personal Care Line for American Men)
- Investment: $11M total advertising budget
- Key Objective: Develop a high-impact 30-second Super Bowl commercial
Last Updated: June 2025 Current Phase: Deliverable 2 - Model Training and Preliminary Insights
-
YouTube: 1,181 Super Bowl ad videos (2000-2025)
- 470 with metadata and comments (~50,000 comments)
-
Reddit: ~10,000 posts/comments (2020-2024)
-
News Articles: ~500 articles linked from Reddit
-
Video Files: 1,181 downloaded MP4s
-
Multimodal Content per Ad:
- Audio (MP3), Subtitle (TXT), Keyframes (JPG)
- Video Metadata: >95% complete
- Reddit & News: Supplemented for missing YouTube discussions
- Comment Richness: Multi-source, sentiment-scored
extract_youtube_id_list.py
: Scrape YouTube IDs from superbowl-ads.comyoutube_info.py
: Fetch metadata & commentsreddit_updata.py
: Search Reddit discussions using PRAWsuperbowldownload.py
: MP4 download fallback for non-YouTube videoswhisper_audio_process.py
: Transcribe audio using Whisper
-
Preprocessing:
- Clean comments, subtitles, descriptions
- Remove emojis, filler text, duplicates
-
Sentiment Classification:
- Models: TextBlob, VADER, RoBERTa, FinBERT, BART
- Output: Positive / Neutral / Negative (via majority vote)
-
Multimodal Feature Extraction:
- Tools: Whisper (audio), FFmpeg (video), Gemini API
- Extracted: Mood, Emotion, Pacing, Slogan, Tone, Symbols
-
Feature Engineering:
- LabelEncoder for categorical features
- StandardScaler for numerical values
- PCA to 20 components
-
Model Training:
- Models: Logistic Regression, Random Forest, SVM, NB, KNN, MLP, CatBoost
- Best Model: CatBoost with 87.3% test accuracy
-
Validation:
- GridSearchCV + 3-fold CV
- EarlyStopping & max_depth to prevent overfitting
- Feature importance visualization
- Visual Style: Color_Tone, Lighting, Composition, Style_Tag
- Mood/Narrative: Emotional_Tone, Structure, Pacing, Twist
- Semantics: Setting, Product Visibility, Masculine Symbols
- Audio/Text: Slogan, Narration Style, Humor Use
- Audience Profile: Gender, Age, Culture, Lifestyle
Task | Status | Action |
---|---|---|
Add 2025 Reddit data | ⏳ | Use reddit_updata.py with year filtering |
Multi-label classification | ✅ | Implement sentiment scoring vector |
Prompt Engineering | ✅ | Refine Gemini instructions for clarity |
Resonance Modeling | ❌ | Design alignment test between ad tone and audience segment |
superbowlproject/
├── config/ # API keys, prompt templates
├── database/ # Processed data & backups
├── deliverable-2-appendix/ # Attachments and examples
├── logs/ # Logging outputs
├── models/ # ML models and scripts
├── notebooks/ # Jupyter notebooks for exploration
├── scripts/
├── src/ # Core modules (scraping, processing, analysis)
├── tests/ # Unit & integration tests
├── requirements.txt # Python dependencies
└── README.md # Documentation
- Python 3.8+
- Reddit API credentials
- YouTube Data API key
- Google Cloud (for Gemini)
git clone https://github.com/SiyuSun341/SuperBowlProject.git
cd SuperBowlProject
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
cp .env.example .env # Fill in your API keys
python scripts/run_data_collection.py
python scripts/train_models.py
python scripts/generate_insights.py
File | Description |
---|---|
docs/api_documentation.md |
API usage reference |
docs/model_documentation.md |
ML model descriptions |
docs/setup_guide.md |
Environment & dependencies |
docs/user_guide.md |
Analysis walkthrough |
docs/superbowl_python_setup.md |
superbowl setup |
- 1181 total ads (1205 total vedios)
-
95% completeness
-
500K total comments
- CatBoost: 87.3% test accuracy
- PCA-d: 20 features, interpretable
- Cross-validation: stable across folds
- Precisely identify 5-7 key success factors for Super Bowl advertisements
- Tailored advertising strategy for the Rogue Ridge personal care product line
- Provide data-driven advertising design recommendations
- Strategic decision support for $11 million advertising investment
- Actionable creative guidance for a 30-second commercial
- Mitigate commercial failure risks
- Help Forge & Field precisely target the intended audience (prototypical American men)
- Comprehensive technical analysis report
- Client-facing advertising design recommendations document
- Pre-launch advertisement testing and optimization framework
Author: Siyu Sun Email: sunsiyu.suzy@gmail.com GitHub: SiyuSun341
This repository is part of an academic project at Purdue University (MBT Program). Do not distribute without permission.