๐ View the Live Digest: xmkxabc.github.io/daae/
This project is an automated tool designed to fetch the latest Computer Science papers from arXiv daily, utilize Large Language Models (LLMs) for intelligent analysis, summarization, and translation, and finally generate easy-to-read daily digests and structured data for a live website.
- โจ Core Features
- ๐ Workflow Overview
- ๐ง Usage and Customization
- ๐ ๏ธ Local Development and Execution
- ๐ฆ Key Dependencies
- ๐ Project Structure Overview
- ๐ค Contributing
- ๐ License
- Live Digest Website: View the AI-enhanced arXiv papers on the live site: xmkxabc.github.io/daae/.
- Automated Workflow: Achieves a complete daily automated process from data fetching to report generation and website deployment via GitHub Actions.
- arXiv Paper Fetching: Uses a Scrapy crawler to obtain metadata of the latest papers from specified arXiv categories.
- AI Text Enhancement:
- Leverages the Langchain framework and configurable LLMs (e.g., Google Gemini, DeepSeek) to summarize and translate paper titles and abstracts.
- Supports multiple output languages (defaults to Chinese for the Markdown reports).
- Diverse Outputs:
- Generates daily Markdown digests (
data/<YYYY-MM-DD>.md
) containing a list of papers and AI-generated summaries. - Dynamically updates the main
README.md
file to display the latest digest, recent reviews, calendar, and historical archives. - Builds a JSON database (
docs/database.json
) containing all processed papers, which powers the live digest website deployed via GitHub Pages from thedocs
directory.
- Generates daily Markdown digests (
- Highly Configurable:
- Easily configure API keys, target arXiv categories, LLM models, output language, etc., through GitHub Actions Secrets and Variables.
- Supports fallback LLM models to improve processing success rates.
- Local Execution & Debugging: Provides a
run.sh
script for convenient testing and execution of the entire processing pipeline in a local environment. - Dependency Management: Uses
uv
for Python package management, ensuring a consistent environment.
- Trigger:
- Automatically triggered daily at a set time (UTC 16:30) via GitHub Actions.
- Can be manually triggered from the Actions page.
- Also triggered on pushes to the
main
branch (for testing convenience).
- Environment Setup:
- Checks out the code repository.
- Installs Python dependencies defined in
pyproject.toml
usinguv sync
.
- Data Processing Pipeline (in
build
job of.github/workflows/run.yml
):- Step 1: Fetch arXiv Papers:
- Navigates to the
daily_arxiv/
directory. - Runs the Scrapy crawler (
scrapy crawl arxiv
), saving raw paper data asdata/<YYYY-MM-DD>.jsonl
.
- Navigates to the
- Step 2: AI Enhancement Processing:
- Executes the
ai/enhance.py
script, reading the raw JSONL file. - Calls the configured LLM (via environment variables like
GOOGLE_API_KEY
,MODEL_NAME
,LANGUAGE
) to summarize and translate papers. - Saves the enhanced data as
data/<YYYY-MM-DD>_AI_enhanced_<LANGUAGE>.jsonl
.
- Executes the
- Step 3: Build JSON Database:
- Runs the
build_database.py
script. - Scans the
data/
directory for all_AI_enhanced_
files. - Merges all paper data into a single
docs/database.json
file for the static website.
- Runs the
- Step 4: Generate Markdown Digest:
- Executes the
to_md/convert.py
script. - Uses
to_md/paper_template.md
as a template to convert the enhanced JSONL file into the day's Markdown digest (data/<YYYY-MM-DD>.md
).
- Executes the
- Step 5: Update Main README.md:
- Runs the
update_readme.py
script. - Reads
readme_content_template.md
as the static framework for the README. - Scans all Markdown digests in the
data/
directory to generate dynamic content like "Latest Digest," "Past 7 Days," "Recent Calendar," and "Full Archive." - Populates the template with this dynamic content and overwrites the
README.md
file in the project root.
- Runs the
- Step 1: Fetch arXiv Papers:
- (Optional) Create GitHub Issue:
- (Currently commented out in the workflow) Originally planned to use the
peter-evans/create-issue-from-file
Action to create a new GitHub Issue with the daily digest content.
- (Currently commented out in the workflow) Originally planned to use the
- Code Commit & Push:
- Configures Git user information (via
EMAIL
andNAME
environment variables). - Adds all newly generated or modified files (digests, database, README) to the staging area.
- If changes exist, commits and pushes them to the
main
branch.
- Configures Git user information (via
- GitHub Pages Deployment (in
deploy
job of.github/workflows/run.yml
):- Depends on the successful completion of the
build
job. - Configures GitHub Pages.
- Uploads the
docs/
directory (containingdatabase.json
and any other static site files) as a Pages artifact. - Deploys to GitHub Pages, making the site available at https://xmkxabc.github.io/daae/.
- Depends on the successful completion of the
For users who want to use the default configuration (fetching papers from cs.CV, cs.GR, cs.CL, using the DeepSeek model, and outputting Chinese digests in Markdown, with the website being language-agnostic based on the data):
- Fork this repository.
- Ensure your GitHub Actions are enabled.
- The project will run automatically daily as scheduled.
- Check the live digest at https://[username].github.io/daae/.
- If you like it, please give this project a Star โญ!
To modify fetch categories, LLMs, language for Markdown reports, etc., follow these steps:
- Fork this repository to your GitHub account.
- Navigate to your repository page, click Settings -> Secrets and variables -> Actions.
- Configure Secrets (for storing sensitive data):
GOOGLE_API_KEY
: Your Google AI API key (e.g., for Gemini models).SECONDARY_GOOGLE_API_KEY
: (Optional) A backup Google AI API key.- (If using OpenAI or other services requiring API keys, add corresponding Secrets and modify
ai/enhance.py
accordingly.)
- Configure Variables (for storing non-sensitive configuration data):
CATEGORIES
: arXiv categories to fetch, comma-separated, e.g.,"cs.AI,cs.LG,cs.RO"
.LANGUAGE
: Target language for Markdown digests, e.g.,"Chinese"
or"English"
. (The website will display data as processed).MODEL_NAME
: Primary LLM model name (e.g., Google Gemini's"gemini-pro"
or DeepSeek's"deepseek-chat"
).FALLBACK_MODELS
: (Optional) Comma-separated list of fallback LLM models to try sequentially if the primary model fails.EMAIL
: Email address for GitHub commits (e.g.,your_username@users.noreply.github.com
).NAME
: Username for GitHub commits (e.g.,your_username
).
- (Optional) Modify GitHub Actions Workflow (
.github/workflows/run.yml
):- Adjust the
cron
expression in theschedule
to change the daily run time. - Modify the
ai/enhance.py
script as needed to support different LLM providers or model parameters.
- Adjust the
- Test Run:
- Go to your repository page, click Actions -> Daily Arxiv Digest & Deploy Website.
- Click Run workflow to trigger a run manually and verify your configuration.
Important Note on README Modifications:
- The sections from "ๆๆฐ้ๆฅ" (Latest Digest in Chinese) downwards in this README file are automatically generated by the
update_readme.py
script. This script usesreadme_content_template.md
as a base template and populates thecontent
placeholder with dynamically generated content. - If you need to modify the static English parts of the README (like the project introduction, usage instructions, etc.), you can directly edit the sections in this
README.md
file that precede the dynamically generated Chinese content. - Please avoid directly editing the dynamically managed content area, as your changes will be overwritten the next time the script runs.
You can use the run.sh
script to simulate the main data processing flow of GitHub Actions in your local environment:
# Ensure uv is installed and your Python environment is set up
# source .venv/bin/activate (if using a virtual environment)
# Set necessary environment variables (if not handled within the script)
# export LANGUAGE="English" # For Markdown report language
# export GOOGLE_API_KEY="your_api_key"
# ... other environment variables required by ai/enhance.py
bash run.sh
This script will sequentially execute:
- Scrapy crawler (
daily_arxiv/scrapy crawl arxiv
) - AI enhancement (
ai/enhance.py
) - Markdown conversion (
to_md/convert.py
) (Note: The target filename inrun.sh
for this step might need theLANGUAGE
variable adjusted based on the actual output ofai/enhance.py
) - README update (
update_readme.py
)
This project primarily relies on the following Python packages (see pyproject.toml
for details):
arxiv
: For interacting with the arXiv API.langchain
&langchain-google-genai
: For integrating and calling Large Language Models.scrapy
: For building the arXiv crawler.python-dotenv
: For managing environment variables (used mainly inai/enhance.py
).
Package management is handled by uv
:
# Install/sync dependencies
uv sync
.
โโโ .github/workflows/run.yml # GitHub Actions workflow definition
โโโ ai/
โ โโโ enhance.py # AI enhancement script (calls LLMs for summary, translation)
โโโ daily_arxiv/
โ โโโ daily_arxiv/
โ โ โโโ spiders/
โ โ โ โโโ arxiv.py # Scrapy spider: fetches arXiv papers
โ โ โโโ settings.py # Scrapy project settings
โ โโโ scrapy.cfg # Scrapy project configuration file
โโโ data/ # Stores daily raw and AI-enhanced paper data (jsonl, md)
โโโ docs/ # Directory for GitHub Pages deployment (contains database.json, index.html, etc.)
โโโ to_md/
โ โโโ convert.py # Converts AI-enhanced jsonl to Markdown digests
โ โโโ paper_template.md # Template for a single paper in Markdown digests
โโโ .gitignore
โโโ .python-version # Specifies Python version (for asdf or pyenv)
โโโ LICENSE
โโโ README.md # This file
โโโ build_database.py # Merges daily AI-enhanced data into docs/database.json
โโโ pyproject.toml # Python project configuration (uv/PEP 621)
โโโ readme_content_template.md # Base template for dynamic content in README.md
โโโ run.sh # Script for running the main flow locally
โโโ template.md # (Appears to be an old or alternative README template, readme_content_template.md is primarily used)
โโโ uv.lock # uv dependency lock file
Contributions to this project are welcome! You can participate by:
- Reporting bugs or suggesting improvements (please open an Issue).
- Submitting Pull Requests to implement new features or fix problems.
- Improving documentation.
This project is licensed under the Apache-2.0 license.
ไธ (Mon) | ไบ (Tue) | ไธ (Wed) | ๅ (Thu) | ไบ (Fri) | ๅ ญ (Sat) | ๆฅ (Sun) |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | |
7 | 8 | 9 | 10 | 11 | 12 | 13 |
14 | 15 | 16 | 17 | 18 | 19 | 20 |
21 | 22 | 23 | 24 | 25 | 26 | 27 |
28 | 29 | 30 | 31 |
ไธ (Mon) | ไบ (Tue) | ไธ (Wed) | ๅ (Thu) | ไบ (Fri) | ๅ ญ (Sat) | ๆฅ (Sun) |
---|---|---|---|---|---|---|
1 | ||||||
2 | 3 | 4 | 5 | 6 | 7 | 8 |
9 | 10 | 11 | 12 | 13 | 14 | 15 |
16 | 17 | 18 | 19 | 20 | 21 | 22 |
23 | 24 | 25 | 26 | 27 | 28 | 29 |
30 |
2025
May
- 2025-05-31
- 2025-05-30
- 2025-05-29
- 2025-05-28
- 2025-05-27
- 2025-05-26
- 2025-05-25
- 2025-05-24
- 2025-05-23
- 2025-05-22
- 2025-05-21
- 2025-05-20
- 2025-05-19
- 2025-05-18
- 2025-05-17
- 2025-05-16
- 2025-05-15
- 2025-05-14
- 2025-05-13
- 2025-05-12
- 2025-05-11
- 2025-05-10
- 2025-05-09
- 2025-05-08
- 2025-05-07
- 2025-05-06
- 2025-05-05
- 2025-05-04
- 2025-05-03
- 2025-05-02
- 2025-05-01
April
- 2025-04-30
- 2025-04-29
- 2025-04-28
- 2025-04-27
- 2025-04-26
- 2025-04-25
- 2025-04-24
- 2025-04-23
- 2025-04-22
- 2025-04-21
- 2025-04-20
- 2025-04-19
- 2025-04-18
- 2025-04-17
- 2025-04-16
- 2025-04-15
- 2025-04-14
- 2025-04-13
- 2025-04-12
- 2025-04-11
- 2025-04-10
- 2025-04-09
- 2025-04-08
- 2025-04-07
- 2025-04-06
- 2025-04-05
- 2025-04-04
- 2025-04-03
- 2025-04-02
- 2025-04-01
This page is automatically updated by a GitHub Action.