AI-Enhanced Daily arXiv Digest

🌐 View the Live Digest: xmkxabc.github.io/daae/

This project is an automated tool designed to fetch the latest Computer Science papers from arXiv daily, utilize Large Language Models (LLMs) for intelligent analysis, summarization, and translation, and finally generate easy-to-read daily digests and structured data for a live website.

✨ Core Features

Live Digest Website: View the AI-enhanced arXiv papers on the live site: xmkxabc.github.io/daae/.
Automated Workflow: Achieves a complete daily automated process from data fetching to report generation and website deployment via GitHub Actions.
arXiv Paper Fetching: Uses a Scrapy crawler to obtain metadata of the latest papers from specified arXiv categories.
AI Text Enhancement:
- Leverages the Langchain framework and configurable LLMs (e.g., Google Gemini, DeepSeek) to summarize and translate paper titles and abstracts.
- Supports multiple output languages (defaults to Chinese for the Markdown reports).
Diverse Outputs:
- Generates daily Markdown digests (data/<YYYY-MM-DD>.md) containing a list of papers and AI-generated summaries.
- Dynamically updates the main README.md file to display the latest digest, recent reviews, calendar, and historical archives.
- Builds a JSON database (docs/database.json) containing all processed papers, which powers the live digest website deployed via GitHub Pages from the docs directory.
Highly Configurable:
- Easily configure API keys, target arXiv categories, LLM models, output language, etc., through GitHub Actions Secrets and Variables.
- Supports fallback LLM models to improve processing success rates.
Local Execution & Debugging: Provides a run.sh script for convenient testing and execution of the entire processing pipeline in a local environment.
Dependency Management: Uses uv for Python package management, ensuring a consistent environment.

🚀 Workflow Overview

Trigger:
- Automatically triggered daily at a set time (UTC 16:30) via GitHub Actions.
- Can be manually triggered from the Actions page.
- Also triggered on pushes to the main branch (for testing convenience).
Environment Setup:
- Checks out the code repository.
- Installs Python dependencies defined in pyproject.toml using uv sync.
Data Processing Pipeline (in build job of .github/workflows/run.yml):
- Step 1: Fetch arXiv Papers:
  - Navigates to the daily_arxiv/ directory.
  - Runs the Scrapy crawler (scrapy crawl arxiv), saving raw paper data as data/<YYYY-MM-DD>.jsonl.
- Step 2: AI Enhancement Processing:
  - Executes the ai/enhance.py script, reading the raw JSONL file.
  - Calls the configured LLM (via environment variables like GOOGLE_API_KEY, MODEL_NAME, LANGUAGE) to summarize and translate papers.
  - Saves the enhanced data as data/<YYYY-MM-DD>_AI_enhanced_<LANGUAGE>.jsonl.
- Step 3: Build JSON Database:
  - Runs the build_database.py script.
  - Scans the data/ directory for all _AI_enhanced_ files.
  - Merges all paper data into a single docs/database.json file for the static website.
- Step 4: Generate Markdown Digest:
  - Executes the to_md/convert.py script.
  - Uses to_md/paper_template.md as a template to convert the enhanced JSONL file into the day's Markdown digest (data/<YYYY-MM-DD>.md).
- Step 5: Update Main README.md:
  - Runs the update_readme.py script.
  - Reads readme_content_template.md as the static framework for the README.
  - Scans all Markdown digests in the data/ directory to generate dynamic content like "Latest Digest," "Past 7 Days," "Recent Calendar," and "Full Archive."
  - Populates the template with this dynamic content and overwrites the README.md file in the project root.
(Optional) Create GitHub Issue:
- (Currently commented out in the workflow) Originally planned to use the peter-evans/create-issue-from-file Action to create a new GitHub Issue with the daily digest content.
Code Commit & Push:
- Configures Git user information (via EMAIL and NAME environment variables).
- Adds all newly generated or modified files (digests, database, README) to the staging area.
- If changes exist, commits and pushes them to the main branch.
GitHub Pages Deployment (in deploy job of .github/workflows/run.yml):
- Depends on the successful completion of the build job.
- Configures GitHub Pages.
- Uploads the docs/ directory (containing database.json and any other static site files) as a Pages artifact.
- Deploys to GitHub Pages, making the site available at https://xmkxabc.github.io/daae/.

🔧 Usage and Customization

Basic Usage

For users who want to use the default configuration (fetching papers from cs.CV, cs.GR, cs.CL, using the DeepSeek model, and outputting Chinese digests in Markdown, with the website being language-agnostic based on the data):

Fork this repository.
Ensure your GitHub Actions are enabled.
The project will run automatically daily as scheduled.
Check the live digest at https://[username].github.io/daae/.
If you like it, please give this project a Star ⭐!

Custom Configuration

To modify fetch categories, LLMs, language for Markdown reports, etc., follow these steps:

Fork this repository to your GitHub account.
Navigate to your repository page, click Settings -> Secrets and variables -> Actions.
Configure Secrets (for storing sensitive data):
- GOOGLE_API_KEY: Your Google AI API key (e.g., for Gemini models).
- SECONDARY_GOOGLE_API_KEY: (Optional) A backup Google AI API key.
- (If using OpenAI or other services requiring API keys, add corresponding Secrets and modify ai/enhance.py accordingly.)
Configure Variables (for storing non-sensitive configuration data):
- CATEGORIES: arXiv categories to fetch, comma-separated, e.g., "cs.AI,cs.LG,cs.RO".
- LANGUAGE: Target language for Markdown digests, e.g., "Chinese" or "English". (The website will display data as processed).
- MODEL_NAME: Primary LLM model name (e.g., Google Gemini's "gemini-pro" or DeepSeek's "deepseek-chat").
- FALLBACK_MODELS: (Optional) Comma-separated list of fallback LLM models to try sequentially if the primary model fails.
- EMAIL: Email address for GitHub commits (e.g., your_username@users.noreply.github.com).
- NAME: Username for GitHub commits (e.g., your_username).
(Optional) Modify GitHub Actions Workflow (.github/workflows/run.yml):
- Adjust the cron expression in the schedule to change the daily run time.
- Modify the ai/enhance.py script as needed to support different LLM providers or model parameters.
Test Run:
- Go to your repository page, click Actions -> Daily Arxiv Digest & Deploy Website.
- Click Run workflow to trigger a run manually and verify your configuration.

Important Note on README Modifications:

The sections from "最新速报" (Latest Digest in Chinese) downwards in this README file are automatically generated by the update_readme.py script. This script uses readme_content_template.md as a base template and populates the content placeholder with dynamically generated content.
If you need to modify the static English parts of the README (like the project introduction, usage instructions, etc.), you can directly edit the sections in this README.md file that precede the dynamically generated Chinese content.
Please avoid directly editing the dynamically managed content area, as your changes will be overwritten the next time the script runs.

🛠️ Local Development and Execution

You can use the run.sh script to simulate the main data processing flow of GitHub Actions in your local environment:

# Ensure uv is installed and your Python environment is set up
# source .venv/bin/activate (if using a virtual environment)

# Set necessary environment variables (if not handled within the script)
# export LANGUAGE="English" # For Markdown report language
# export GOOGLE_API_KEY="your_api_key"
# ... other environment variables required by ai/enhance.py

bash run.sh

This script will sequentially execute:

Scrapy crawler (daily_arxiv/scrapy crawl arxiv)
AI enhancement (ai/enhance.py)
Markdown conversion (to_md/convert.py) (Note: The target filename in run.sh for this step might need the LANGUAGE variable adjusted based on the actual output of ai/enhance.py)
README update (update_readme.py)

📦 Key Dependencies

This project primarily relies on the following Python packages (see pyproject.toml for details):

arxiv: For interacting with the arXiv API.
langchain & langchain-google-genai: For integrating and calling Large Language Models.
scrapy: For building the arXiv crawler.
python-dotenv: For managing environment variables (used mainly in ai/enhance.py).

Package management is handled by uv:

# Install/sync dependencies
uv sync

📂 Project Structure Overview

.
├── .github/workflows/run.yml  # GitHub Actions workflow definition
├── ai/
│   └── enhance.py             # AI enhancement script (calls LLMs for summary, translation)
├── daily_arxiv/
│   ├── daily_arxiv/
│   │   ├── spiders/
│   │   │   └── arxiv.py       # Scrapy spider: fetches arXiv papers
│   │   └── settings.py        # Scrapy project settings
│   └── scrapy.cfg             # Scrapy project configuration file
├── data/                        # Stores daily raw and AI-enhanced paper data (jsonl, md)
├── docs/                        # Directory for GitHub Pages deployment (contains database.json, index.html, etc.)
├── to_md/
│   ├── convert.py             # Converts AI-enhanced jsonl to Markdown digests
│   └── paper_template.md      # Template for a single paper in Markdown digests
├── .gitignore
├── .python-version              # Specifies Python version (for asdf or pyenv)
├── LICENSE
├── README.md                    # This file
├── build_database.py            # Merges daily AI-enhanced data into docs/database.json
├── pyproject.toml               # Python project configuration (uv/PEP 621)
├── readme_content_template.md   # Base template for dynamic content in README.md
├── run.sh                       # Script for running the main flow locally
├── template.md                  # (Appears to be an old or alternative README template, readme_content_template.md is primarily used)
└── uv.lock                      # uv dependency lock file

🤝 Contributing

Contributions to this project are welcome! You can participate by:

Reporting bugs or suggesting improvements (please open an Issue).
Submitting Pull Requests to implement new features or fix problems.
Improving documentation.

Star History

Status

📜 License

This project is licensed under the Apache-2.0 license.

最新速报：2025-07-03

阅读 2025-07-03 的完整报告...

本周回顾 (Past 7 Days)

近期日历 (Recent Calendar)

July 2025

一 (Mon)	二 (Tue)	三 (Wed)	四 (Thu)	五 (Fri)	六 (Sat)	日 (Sun)
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

June 2025

一 (Mon)	二 (Tue)	三 (Wed)	四 (Thu)	五 (Fri)	六 (Sat)	日 (Sun)
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

历史存档 (Full Archive)

2025

May

April

March

This page is automatically updated by a GitHub Action.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI-Enhanced Daily arXiv Digest

Table of Contents

✨ Core Features

🚀 Workflow Overview

🔧 Usage and Customization

Basic Usage

Custom Configuration

🛠️ Local Development and Execution

📦 Key Dependencies

📂 Project Structure Overview

🤝 Contributing

Star History

Status

📜 License

最新速报：2025-07-03

本周回顾 (Past 7 Days)

近期日历 (Recent Calendar)

July 2025

June 2025

历史存档 (Full Archive)

About

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 450 Commits
.github/workflows		.github/workflows
ai		ai
daily_arxiv		daily_arxiv
data		data
docs		docs
to_md		to_md
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
MOBILE_OPTIMIZATION.md		MOBILE_OPTIMIZATION.md
README.md		README.md
build_database.py		build_database.py
pyproject.toml		pyproject.toml
readme_content_template.md		readme_content_template.md
requirements.txt		requirements.txt
run.sh		run.sh
template.md		template.md
update_readme.py		update_readme.py
uv.lock		uv.lock

License

xmkxabc/daae

Folders and files

Latest commit

History

Repository files navigation

AI-Enhanced Daily arXiv Digest

Table of Contents

✨ Core Features

🚀 Workflow Overview

🔧 Usage and Customization

Basic Usage

Custom Configuration

🛠️ Local Development and Execution

📦 Key Dependencies

📂 Project Structure Overview

🤝 Contributing

Star History

Status

📜 License

最新速报：2025-07-03

本周回顾 (Past 7 Days)

近期日历 (Recent Calendar)

July 2025

June 2025

历史存档 (Full Archive)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages