A comprehensive tool for exploring and analyzing New York Times science articles through advanced topic modeling, hierarchical clustering, and source analysis. This project uses large language models (LLMs) and machine learning techniques to help readers quickly understand scientific publishing patterns, dive deeper into individual articles, and explore journalistic reporting processes.
- Bird's-eye view of NYT's scientific publishing output
- Custom extension on TopicGPT for topic modeling
- Hierarchical clustering with agglomerative methods
- Interactive tree visualization showing article categorization patterns
- Detailed exploration of individual science articles
- Difference analysis between journalist angles and original scientific sources
- Scientific figure extraction and AI-powered explanations
- Enhanced understanding of scientific content through visual aids
- Analysis of journalistic sources and methodologies
- Source categorization by centrality (High/Medium/Low)
- Perspective analysis and source verification tracking
- Insight into how stories are reported and fact-checked
The main web application providing an interactive interface for exploring the analyzed data.
Key Files:
main.py
- Flask application with routes for topic trees, article exploration, and source analysistemplates/
- HTML templates for different views:tree_template_horizontal.html
- Interactive topic hierarchy visualizationdig_deeper_article_viewer.html
- Detailed article analysis interfaceexplore_reporting_doc_source_data.html
- Source analysis dashboardindex.html
- Main landing page
app_data/
- Processed data files for the web applicationstatic/
- CSS and JavaScript assets
Core algorithms for organizing and clustering articles by topic.
Key Files:
tree_functions.py
- Main hierarchical clustering implementation- Agglomerative clustering using Ward linkage
- Optimal threshold finding algorithms
- Tree structure creation and evaluation metrics
tree_helper_functions.py
- Tree manipulation utilities- Tree pruning by subtree size
- Data propagation algorithms
- Node labeling and summarization
agglomerative_clustering.py
- K-means initialization + hierarchical clustering pipelineconversion_functions.py
- Linkage matrix to NetworkX graph conversionprompts.py
- OpenAI prompts for topic labeling and summarizationbasic_util.py
- OpenAI API utilities
Jupyter notebooks for data processing, analysis, and experimentation.
Key Notebooks:
2025-04-25__topic-summary.ipynb
- Topic modeling and hierarchy generation2025-04-25__source-labeling.ipynb
- Source analysis and categorization2025-04-24__data-processing-and-science-article-labeling.ipynb
- Data preprocessing pipeline2025-04-27__diving-deeper-into-articles.ipynb
- Article content analysis
Contains the dataset of NYT science articles and processed outputs.
Key Files:
science_articles.json.gz
- Main dataset of science articlesfull-parsed-source-df.jsonl
- Parsed source data with metadataoutput_data/
- Processed clustering and analysis resultsfound-science-articles/
- Raw article collection
Visual documentation of the application's user interface and features.
- Initial Clustering: K-means clustering (128 clusters) on sentence embeddings
- Hierarchical Organization: Agglomerative clustering with Ward linkage on cluster centers
- Tree Optimization: Automatic threshold selection using silhouette analysis and cophenetic correlation
- Labeling: OpenAI-powered topic labeling and summarization
- Centrality Classification: High/Medium/Low importance ranking
- Perspective Analysis: Multiple viewpoint identification
- Verification Tracking: Direct vs. indirect source documentation
- Statistical Aggregation: Source composition analysis
Based on custom differencing algorithms to identify:
- Angle differences between journalist and scientific sources
- Content gaps and additions
- Emphasis and framing variations
- Python 3.8+
- Flask
- scikit-learn
- NetworkX
- OpenAI API key
- Clone the repository:
git clone https://github.com/your-username/nytimes-science-article-project.git
cd nytimes-science-article-project
- Install dependencies:
cd app
pip install -r requirements.txt
- Set up OpenAI API key:
export OPENAI_API_KEY="your-api-key-here"
- Run the application:
python main.py
- Open your browser to
http://localhost:8080
To regenerate the analysis from scratch:
- Data Collection: Process raw NYT science articles
- Embedding Generation: Create sentence embeddings using
all-MiniLM-L6-v2
- Topic Modeling: Run the clustering pipeline in
notebooks/2025-04-25__topic-summary.ipynb
- Source Analysis: Execute source labeling in
notebooks/2025-04-25__source-labeling.ipynb
This project demonstrates applications of several research areas:
- Topic Modeling: Extensions on TopicGPT for hierarchical organization
- Source Analysis: Multi-document source identification and categorization
- Content Analysis: Differencing algorithms for document comparison
- Information Visualization: Interactive exploration of complex datasets
- Human-AI Interaction: LLM-powered content understanding tools
This project is part of ongoing research into LLM-assisted information understanding. Feedback and contributions are welcome!
Research Papers Referenced:
- TopicGPT - Topic modeling framework
- Document Differencing - Content comparison algorithms
- Source Identification - Automatic source detection
- Source Categorization - Source classification methods
Created by Alexander Spangher, PhD student at USC, researching how LLMs can help readers understand large amounts of information quickly.
- Website: alexander-spangher.com
- Twitter: @AlexanderSpangh
- Email: spangher@usc.edu
This project is intended for research and educational purposes.