IMPORTANT NOTE: This is currently a Proof of Concept (POC) project intended to demonstrate the feasibility of automated news collection and analysis. It is not intended for production use without further development and hardening.
A full-stack Node.js application for collecting, analyzing, and serving news articles from RSS feeds. This application parses OPML files containing RSS feed URLs, fetches articles, extracts the main content, performs keyword analysis, and serves content through a REST API with a Vue.js frontend.
This project is a Proof of Concept that demonstrates:
- Automated news collection and content extraction
- Implementation of keyword-based content profiling
- Dynamic recommendation system based on user interactions
- Simple web interface for content browsing
While functional, some areas would need further development for production use:
- Enhanced error handling and recovery
- Improved security measures
- Better scalability for larger article volumes
- Production-grade logging and monitoring
- Additional test coverage
-
Article Content Extraction:
- Uses @extractus/article-extractor to identify and extract the main content from HTML web pages
- Removes boilerplate elements, advertisements, and navigation components
- Preserves the primary text, images, and semantic structure of articles
-
Keyword Extraction Pipeline:
- Primary Method: OpenAI GPT-4o mini model with a specialized tagging prompt
- Intelligently identifies 4-7 relevant tags per article
- Performs linguistic analysis to extract meaningful topics beyond simple frequency counting
- Fallback Method: Google Natural Language API (Entity Analysis)
- Identifies entities with salience scores
- Filters by configurable salience threshold (default 0.01)
- Limits to configurable maximum keywords per article
- Primary Method: OpenAI GPT-4o mini model with a specialized tagging prompt
-
Advanced Content Organization:
- Keyword-based article organization without complex topic modeling
- Efficient content categorization through feed categories and keyword metadata
- Content similarity analysis based on keyword overlap with normalized scoring
- Dynamic content relationships through user interaction patterns
-
User Profile Generation:
- Builds keyword profiles based on user interaction history
- Implements a time-decay algorithm with 30-day half-life for recency bias
- Weighs interactions by type (thumbs_up: 5.0, click: 1.0, thumbs_down: -3.0)
-
Multi-factor Scoring Algorithm:
- Keyword matching (40%): Compares article keywords against user profile keywords
- Category preference (20%): Based on historical category interactions
- Source preference (20%): Based on historical source interactions
- Recency score (20%): Exponential decay based on article age
- Direct interaction boost: Additional score for previously interacted articles
-
Content Similarity Engine:
- Identifies similar articles through keyword overlap analysis
- Calculates similarity score using a normalized vector comparison:
similarityScore = matchingKeywords.length / √(sourceKeywords.length × targetKeywords.length)
- Combines similarity with user interaction for final ranking
- Filters to ensure minimum relevance threshold
- Node.js: JavaScript runtime environment
- Express: Web server framework
- Better-SQLite3: SQLite database driver with performance optimizations
- RSS Parser: For fetching and parsing RSS feeds
- Natural: NLP library for text processing and tokenization
- Stopword: For removing common stopwords from text
- OpenAI API: For advanced keyword extraction using GPT-4o mini
- Google Cloud Natural Language API: For entity analysis and backup keyword extraction
- Axios: Promise-based HTTP client for fetching article content
- dotenv: Environment variable management
- xml2js: XML parsing for OPML files
- Turndown: HTML to Markdown conversion
- Vue.js 3: Frontend framework with Composition API
- Vite: Build tool and development server
- TailwindCSS: Utility-first CSS framework for styling
- PostCSS/Autoprefixer: CSS processing and browser compatibility
- OPML Parsing: Parse OPML files to extract RSS feed information
- RSS Feed Fetching: Fetch articles from RSS feeds
- Content Extraction: Extract the main content from article URLs using @extractus/article-extractor
- Keyword Extraction: Extract relevant keywords from article content
- REST API: Serve articles and recommendations through a REST API
- Vue.js Frontend: Browse and interact with articles through a modern web interface
- Admin Dashboard: Monitor system statistics and user preference profiles
- SQLite Database: Efficient storage and querying of articles and interaction data
- Dynamic Recommendations: Personalized article recommendations based on keyword matching and user interactions
NewsFeedSolo/
├── src/ # Backend source code
│ ├── index.js # Main collector entry point
│ ├── server.js # API server
│ ├── database.js # Database operations
│ ├── opmlParser.js # OPML file parsing
│ ├── rssFetcher.js # RSS feed fetching
│ ├── contentFetcher.js # Content fetching
│ ├── articleExtractor.js # Article extraction
│ ├── keywordExtractor.js # Keyword extraction
│ └── storage.js # File storage management
├── frontend/ # Vue.js frontend application
│ ├── src/ # Frontend source code
│ │ ├── App.vue # Main application component
│ │ ├── Admin.vue # Admin dashboard
│ │ └── main.js # Frontend entry point
│ └── index.html # Frontend HTML template
├── opml/ # OPML feed configuration
│ ├── development.opml # Development news feeds
│ ├── diy.opml # DIY project feeds
│ ├── science.opml # Science news feeds
│ ├── retro.opml # Retro computing/gaming feeds
│ └── us_politics.opml # US politics news feeds
├── storage/ # Article storage
│ ├── news.db # SQLite database
└── utils/ # Utility scripts
└── migrate.js # Database migration tools
The application uses an SQLite Database (storage/news.db
) for:
- Article metadata and content
- User interactions
- Article keywords
- Recommendation data
The application features a keyword-based recommendation system:
- User Preference Profiles: Automatically built based on user interactions
- Time-Decay Algorithm: Recent interactions have higher weight (30-day half-life)
- Multifactor Scoring: Combines keyword matching, category preferences, source preferences, and recency
- Content Similarity: Finds similar articles based on keyword overlap
- Configurable Weights: All recommendation weights can be adjusted via environment variables
For a detailed explanation of all recommendation system parameters, scoring algorithms, and tuning guidance, see RECOMMENDATION_SYSTEM.md.
Articles are scored based on multiple factors:
- Keyword match (40%): How well article keywords match user preferences
- Category preference (20%): User's historical interaction with content categories
- Source preference (20%): User's historical interaction with content sources
- Recency (20%): Newer content receives a boost
- Direct interaction: Additional boost for articles directly interacted with
All recommendation scoring weights are configurable through environment variables:
-
Factor Weights:
KEYWORD_MATCH_WEIGHT=0.4
- Weight for keyword matchingCATEGORY_WEIGHT=0.2
- Weight for category preferenceSOURCE_WEIGHT=0.2
- Weight for source preferenceRECENCY_WEIGHT=0.2
- Weight for recency score
-
Interaction Weights:
THUMBS_UP_WEIGHT=5.0
- Weight for "thumbs up" interactionsTHUMBS_DOWN_WEIGHT=-3.0
- Weight for "thumbs down" interactionsCLICK_WEIGHT=1.0
- Weight for article click interactions
-
Time Decay:
INTERACTION_DECAY_DAYS=30
- Half-life period (in days) for the interaction decay algorithm
GET /api/articles
- Get articles with optional filteringGET /api/categories
- Get available article categoriesGET /api/recommendations
- Get personalized article recommendationsGET /api/articles/:id/similar
- Get articles similar to a specific articleGET /api/profile
- Get user preference profilePOST /api/articles/:id/interaction
- Track user interactionsGET /api/admin/stats
- Get system statistics
- Node.js 16.x or higher
- npm or yarn
- SQLite 3
-
Clone the repository:
git clone https://github.com/yourusername/newsfeedsolo.git cd newsfeedsolo
-
Install backend dependencies:
npm install
-
Install frontend dependencies:
cd frontend npm install
-
Start the backend services:
# Start the collector and API server npm start
-
Start the frontend development server:
cd frontend npm run dev
-
Access the application:
- Frontend: http://localhost:5173
- API: http://localhost:3000
Configure the application through environment variables or modify the source files:
-
Copy the example environment file to create your own:
cp .env.example .env
-
Edit the
.env
file with your specific configuration:- API keys for OpenAI and Google Cloud
- Database configuration
- Content extraction settings
- Collection and recommendation parameters
Other configuration files:
src/index.js
: Collector settingssrc/server.js
: API server configurationsrc/database.js
: Recommendation system parametersfrontend/vite.config.js
: Frontend build configuration
MIT