8000 GitHub - xmkxabc/daae: This tool will daily crawl https://arxiv.org and use LLMs to summarize them. cs.CR, cs.AI, cs.LG, cs.MA, cs.RO, cs.CV, cs.HC, cs.ET, cs.SE, cs.SI, cs.NI, cs.IT, cs.AR, cs.DC, cs.CY, cs.CE, cs.FL, eess.SY, eess.SP, eess.IV, eess.AS, cs.CL, cs.DS, cs.GR, cs.IR, cs.NE, math.NA, cs.SD, cs.SC
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

This tool will daily crawl https://arxiv.org and use LLMs to summarize them. cs.CR, cs.AI, cs.LG, cs.MA, cs.RO, cs.CV, cs.HC, cs.ET, cs.SE, cs.SI, cs.NI, cs.IT, cs.AR, cs.DC, cs.CY, cs.CE, cs.FL, eess.SY, eess.SP, eess.IV, eess.AS, cs.CL, cs.DS, cs.GR, cs.IR, cs.NE, math.NA, cs.SD, cs.SC

License

Notifications You must be signed in to change notification settings

xmkxabc/daae

ย 
ย 

Repository files navigation

AI-Enhanced Daily arXiv Digest

๐ŸŒ View the Live Digest: xmkxabc.github.io/daae/

This project is an automated tool designed to fetch the latest Computer Science papers from arXiv daily, utilize Large Language Models (LLMs) for intelligent analysis, summarization, and translation, and finally generate easy-to-read daily digests and structured data for a live website.

Table of Contents

โœจ Core Features

  • Live Digest Website: View the AI-enhanced arXiv papers on the live site: xmkxabc.github.io/daae/.
  • Automated Workflow: Achieves a complete daily automated process from data fetching to report generation and website deployment via GitHub Actions.
  • arXiv Paper Fetching: Uses a Scrapy crawler to obtain metadata of the latest papers from specified arXiv categories.
  • AI Text Enhancement:
    • Leverages the Langchain framework and configurable LLMs (e.g., Google Gemini, DeepSeek) to summarize and translate paper titles and abstracts.
    • Supports multiple output languages (defaults to Chinese for the Markdown reports).
  • Diverse Outputs:
    • Generates daily Markdown digests (data/<YYYY-MM-DD>.md) containing a list of papers and AI-generated summaries.
    • Dynamically updates the main README.md file to display the latest digest, recent reviews, calendar, and historical archives.
    • Builds a JSON database (docs/database.json) containing all processed papers, which powers the live digest website deployed via GitHub Pages from the docs directory.
  • Highly Configurable:
    • Easily configure API keys, target arXiv categories, LLM models, output language, etc., through GitHub Actions Secrets and Variables.
    • Supports fallback LLM models to improve processing success rates.
  • Local Execution & Debugging: Provides a run.sh script for convenient testing and execution of the entire processing pipeline in a local environment.
  • Dependency Management: Uses uv for Python package management, ensuring a consistent environment.

๐Ÿš€ Workflow Overview

  1. Trigger:
    • Automatically triggered daily at a set time (UTC 16:30) via GitHub Actions.
    • Can be manually triggered from the Actions page.
    • Also triggered on pushes to the main branch (for testing convenience).
  2. Environment Setup:
    • Checks out the code repository.
    • Installs Python dependencies defined in pyproject.toml using uv sync.
  3. Data Processing Pipeline (in build job of .github/workflows/run.yml):
    • Step 1: Fetch arXiv Papers:
      • Navigates to the daily_arxiv/ directory.
      • Runs the Scrapy crawler (scrapy crawl arxiv), saving raw paper data as data/<YYYY-MM-DD>.jsonl.
    • Step 2: AI Enhancement Processing:
      • Executes the ai/enhance.py script, reading the raw JSONL file.
      • Calls the configured LLM (via environment variables like GOOGLE_API_KEY, MODEL_NAME, LANGUAGE) to summarize and translate papers.
      • Saves the enhanced data as data/<YYYY-MM-DD>_AI_enhanced_<LANGUAGE>.jsonl.
    • Step 3: Build JSON Database:
      • Runs the build_database.py script.
      • Scans the data/ directory for all _AI_enhanced_ files.
      • Merges all paper data into a single docs/database.json file for the static website.
    • Step 4: Generate Markdown Digest:
      • Executes the to_md/convert.py script.
      • Uses to_md/paper_template.md as a template to convert the enhanced JSONL file into the day's Markdown digest (data/<YYYY-MM-DD>.md).
    • Step 5: Update Main README.md:
      • Runs the update_readme.py script.
      • Reads readme_content_template.md as the static framework for the README.
      • Scans all Markdown digests in the data/ directory to generate dynamic content like "Latest Digest," "Past 7 Days," "Recent Calendar," and "Full Archive."
      • Populates the template with this dynamic content and overwrites the README.md file in the project root.
  4. (Optional) Create GitHub Issue:
    • (Currently commented out in the workflow) Originally planned to use the peter-evans/create-issue-from-file Action to create a new GitHub Issue with the daily digest content.
  5. Code Commit & Push:
    • Configures Git user information (via EMAIL and NAME environment variables).
    • Adds all newly generated or modified files (digests, database, README) to the staging area.
    • If changes exist, commits and pushes them to the main branch.
  6. GitHub Pages Deployment (in deploy job of .github/workflows/run.yml):
    • Depends on the successful completion of the build job.
    • Configures GitHub Pages.
    • Uploads the docs/ directory (containing database.json and any other static site files) as a Pages artifact.
    • Deploys to GitHub Pages, making the site available at https://xmkxabc.github.io/daae/.

๐Ÿ”ง Usage and Customization

Basic Usage

For users who want to use the default configuration (fetching papers from cs.CV, cs.GR, cs.CL, using the DeepSeek model, and outputting Chinese digests in Markdown, with the website being language-agnostic based on the data):

  1. Fork this repository.
  2. Ensure your GitHub Actions are enabled.
  3. The project will run automatically daily as scheduled.
  4. Check the live digest at https://[username].github.io/daae/.
  5. If you like it, please give this project a Star โญ!

Custom Configuration

To modify fetch categories, LLMs, language for Markdown reports, etc., follow these steps:

  1. Fork this repository to your GitHub account.
  2. Navigate to your repository page, click Settings -> Secrets and variables -> Actions.
  3. Configure Secrets (for storing sensitive data):
    • GOOGLE_API_KEY: Your Google AI API key (e.g., for Gemini models).
    • SECONDARY_GOOGLE_API_KEY: (Optional) A backup Google AI API key.
    • (If using OpenAI or other services requiring API keys, add corresponding Secrets and modify ai/enhance.py accordingly.)
  4. Configure Variables (for storing non-sensitive configuration data):
    • CATEGORIES: arXiv categories to fetch, comma-separated, e.g., "cs.AI,cs.LG,cs.RO".
    • LANGUAGE: Target language for Markdown digests, e.g., "Chinese" or "English". (The website will display data as processed).
    • MODEL_NAME: Primary LLM model name (e.g., Google Gemini's "gemini-pro" or DeepSeek's "deepseek-chat").
    • FALLBACK_MODELS: (Optional) Comma-separated list of fallback LLM models to try sequentially if the primary model fails.
    • EMAIL: Email address for GitHub commits (e.g., your_username@users.noreply.github.com).
    • NAME: Username for GitHub commits (e.g., your_username).
  5. (Optional) Modify GitHub Actions Workflow (.github/workflows/run.yml):
    • Adjust the cron expression in the schedule to change the daily run time.
    • Modify the ai/enhance.py script as needed to support different LLM providers or model parameters.
  6. Test Run:
    • Go to your repository page, click Actions -> Daily Arxiv Digest & Deploy Website.
    • Click Run workflow to trigger a run manually and verify your configuration.

Important Note on README Modifications:

  • The sections from "ๆœ€ๆ–ฐ้€ŸๆŠฅ" (Latest Digest in Chinese) downwards in this README file are automatically generated by the update_readme.py script. This script uses readme_content_template.md as a base template and populates the content placeholder with dynamically generated content.
  • If you need to modify the static English parts of the README (like the project introduction, usage instructions, etc.), you can directly edit the sections in this README.md file that precede the dynamically generated Chinese content.
  • Please avoid directly editing the dynamically managed content area, as your changes will be overwritten the next time the script runs.

๐Ÿ› ๏ธ Local Development and Execution

You can use the run.sh script to simulate the main data processing flow of GitHub Actions in your local environment:

# Ensure uv is installed and your Python environment is set up
# source .venv/bin/activate (if using a virtual environment)

# Set necessary environment variables (if not handled within the script)
# export LANGUAGE="English" # For Markdown report language
# export GOOGLE_API_KEY="your_api_key"
# ... other environment variables required by ai/enhance.py

bash run.sh

This script will sequentially execute:

  1. Scrapy crawler (daily_arxiv/scrapy crawl arxiv)
  2. AI enhancement (ai/enhance.py)
  3. Markdown conversion (to_md/convert.py) (Note: The target filename in run.sh for this step might need the LANGUAGE variable adjusted based on the actual output of ai/enhance.py)
  4. README update (update_readme.py)

๐Ÿ“ฆ Key Dependencies

This project primarily relies on the following Python packages (see pyproject.toml for details):

  • arxiv: For interacting with the arXiv API.
  • langchain & langchain-google-genai: For integrating and calling Large Language Models.
  • scrapy: For building the arXiv crawler.
  • python-dotenv: For managing environment variables (used mainly in ai/enhance.py).

Package management is handled by uv:

# Install/sync dependencies
uv sync

๐Ÿ“‚ Project Structure Overview

.
โ”œโ”€โ”€ .github/workflows/run.yml  # GitHub Actions workflow definition
โ”œโ”€โ”€ ai/
โ”‚   โ””โ”€โ”€ enhance.py             # AI enhancement script (calls LLMs for summary, translation)
โ”œโ”€โ”€ daily_arxiv/
โ”‚   โ”œโ”€โ”€ daily_arxiv/
โ”‚   โ”‚   โ”œโ”€โ”€ spiders/
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ arxiv.py       # Scrapy spider: fetches arXiv papers
โ”‚   โ”‚   โ””โ”€โ”€ settings.py        # Scrapy project settings
โ”‚   โ””โ”€โ”€ scrapy.cfg             # Scrapy project configuration file
โ”œโ”€โ”€ data/                        # Stores daily raw and AI-enhanced paper data (jsonl, md)
โ”œโ”€โ”€ docs/                        # Directory for GitHub Pages deployment (contains database.json, index.html, etc.)
โ”œโ”€โ”€ to_md/
โ”‚   โ”œโ”€โ”€ convert.py             # Converts AI-enhanced jsonl to Markdown digests
โ”‚   โ””โ”€โ”€ paper_template.md      # Template for a single paper in Markdown digests
โ”œโ”€โ”€ .gitignore
โ”œโ”€โ”€ .python-version              # Specifies Python version (for asdf or pyenv)
โ”œโ”€โ”€ LICENSE
โ”œโ”€โ”€ README.md                    # This file
โ”œโ”€โ”€ build_database.py            # Merges daily AI-enhanced data into docs/database.json
โ”œโ”€โ”€ pyproject.toml               # Python project configuration (uv/PEP 621)
โ”œโ”€โ”€ readme_content_template.md   # Base template for dynamic content in README.md
โ”œโ”€โ”€ run.sh                       # Script for running the main flow locally
โ”œโ”€โ”€ template.md                  # (Appears to be an old or alternative README template, readme_content_template.md is primarily used)
โ””โ”€โ”€ uv.lock                      # uv dependency lock file

๐Ÿค Contributing

Contributions to this project are welcome! You can participate by:

  • Reporting bugs or suggesting improvements (please open an Issue).
  • Submitting Pull Requests to implement new features or fix problems.
  • Improving documentation.

Star History

Star History Chart

Status

arXiv-daily-ai-enhanced

๐Ÿ“œ License

This project is licensed under the Apache-2.0 license.


ๆœ€ๆ–ฐ้€ŸๆŠฅ๏ผš2025-07-03

้˜…่ฏป 2025-07-03 ็š„ๅฎŒๆ•ดๆŠฅๅ‘Š...


ๆœฌๅ‘จๅ›ž้กพ (Past 7 Days)


่ฟ‘ๆœŸๆ—ฅๅކ (Recent Calendar)

July 2025

ไธ€ (Mon) ไบŒ (Tue) ไธ‰ (Wed) ๅ›› (Thu) ไบ” (Fri) ๅ…ญ (Sat) ๆ—ฅ (Sun)
1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31

June 2025

ไธ€ (Mon) ไบŒ (Tue) ไธ‰ (Wed) ๅ›› (Thu) ไบ” (Fri) ๅ…ญ (Sat) ๆ—ฅ (Sun)
1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30

ๅކๅฒๅญ˜ๆกฃ (Full Archive)

2025
May
April
March

This page is automatically updated by a GitHub Action.

About

This tool will daily crawl https://arxiv.org and use LLMs to summarize them. cs.CR, cs.AI, cs.LG, cs.MA, cs.RO, cs.CV, cs.HC, cs.ET, cs.SE, cs.SI, cs.NI, cs.IT, cs.AR, cs.DC, cs.CY, cs.CE, cs.FL, eess.SY, eess.SP, eess.IV, eess.AS, cs.CL, cs.DS, cs.GR, cs.IR, cs.NE, math.NA, cs.SD, cs.SC

Resources

License

Stars

Watchers

Forks

Languages

  • Python 99.6%
  • Shell 0.4%
0