A data pipeline for collecting and consolidating social media activity across multiple platforms (Twitter, GitHub, Raindrop.io) into a unified JSONL format.
┌─────────────────┐ ┌──────────────┐ ┌──────────────┐
│ Data Sources │ │ DeltaLoad │ │ Output │
│ ───────────── │ │ ────────── │ │ ────────── │
│ • Twitter │ ──► │ Extract │ │ Unified │
│ • GitHub │ │ Transform │ ──► │ JSONL │
│ • Raindrop.io │ │ Load │ │ Format │
└─────────────────┘ └──────────────┘ └──────────────┘
- Delta loading (only fetches new data since last run)
- Unified data format across platforms
- Error handling and logging
- Rate limit awareness
- Configurable through environment variables
The system uses a unified bookmark format with the following fields:
id
: Unique identifiercreated_at
: Timestamp of creationurl
: Source URLsource
: Platform identifier (twitter|github|raindrop)content
: Title or excerptfull_text
: Complete content with the following structure:- For tweets: Full thread text with metadata
- For GitHub: Complete README content with repo info
- For webpages: Parsed article content with metadata
- For Raindrop: Bookmark data with tags and notes
- Clone the repository
- Create a
.env
file with required tokens:
GH_TOKEN=your_github_token
RAINDROP_TOKEN=your_raindrop_token
TWITTER_USER_ID=your_twitter_user_id
TWITTER_CT0=your_twitter_ct0
TWITTER_AUTH_TOKEN=your_twitter_auth_token
- Install dependencies:
# Python dependencies
pip install -r requirements.txt
# Node dependencies (for TypeScript components)
npm install
Run the main pipeline:
python deltaload.py
Data will be saved to data-bookmark.jsonl
in the following format:
{
"id": 123,
"created_at": "2024-01-20T00:00:00Z",
"url": "https://example.com",
"source": "twitter|github|raindrop",
"content": "content text",
"metadata": {}
}
deltaload.py
: Main ETL pipelinejinaparse.ts
: TypeScript parser for data processingtest scrapers/
: Various scraper implementations*_cache/
: Cache directories for different data sources
- Tweet Thread Operations:
- Fetch complete thread context
- Extract thread participants
- Preserve thread hierarchy
- Store media attachments
- Track engagement metrics
- GitHub README Operations:
- Fetch repository README files
- Extract repository metadata
- Track stars and forks
- Monitor README updates
- Store commit history
- Web Content Operations:
- Raw HTML archival
- Content extraction with readability
- Metadata parsing (OpenGraph, Schema.org)
- Image and asset preservation
- Link validation
- Enhanced Tweet Capture:
- Use tweet-snap for visual preservation
- Store tweet cards and embeds
- Track tweet edit history
- Preserve thread context
- Archive referenced content
The system follows a monolithic architecture for simplicity and maintainability:
- Core Operations:
- Create: Insert new bookmarks
- Read: Query by source, date, content
- Update: Modify existing entries
- Delete: Remove outdated content
- Merge: Combine duplicate entries
- Advanced Operations:
- Full-text search
- Tag management
- Content deduplication
- Version tracking
- Backup and restore
The system supports multiple data formats for flexibility:
- Primary Format (JSONL):
- Line-delimited JSON
- Efficient for streaming
- Easy to parse and generate
- Supports incremental updates
- Export Formats:
- CSV for spreadsheet analysis
- Markdown for documentation
- HTML for web viewing
- PDF for archival
- SQLite for local querying
The system runs hourly checks for:
- New tweets and threads
- README updates
- Bookmark changes
- Content updates
- Database Enhancements:
- Implement SQLite for indexed queries
- Add PostgreSQL option for scaling
- Support distributed operations
- Implement caching layer
- Content Processing:
- Add NLP for content summarization
- Implement topic clustering
- Support content translation
- Add semantic search
- Data Format Extensions:
- Support ActivityPub format
- Add RSS/Atom feeds
- Implement WebMention
- Support IPFS storage
The repository now includes a powerful bookmark enrichment system that can extract content from various sources:
- Twitter Thread Fetching: Extract complete Twitter threads from URLs
- GitHub Repository Scraping: Fetch repository metadata and README content
- General Web Scraping: Extract content from any website
- Content Caching: Cache scraped content to avoid redundant API calls
- JSONL Processing: Read and write bookmark data in JSONL format
Use the enrich_bookmarks.py
script to enrich bookmark data:
python enrich_bookmarks.py data-bookmark.jsonl --output-file enriched.jsonl
input_file
: Input JSONL file with bookmarks--output-file
,-o
: Output JSONL file (default: input file with .enriched suffix)--cache-dir
,-c
: Cache directory for enriched content (default: ./cache)--force-refresh
,-f
: Force refresh cached content--skip-existing
,-s
: Skip bookmarks that already have content--limit
,-l
: Limit processing to specified number of bookmarks--filter-source
: Only process bookmarks from a specific source (twitter, github, or web)
You can also use the system programmatically:
from tools.bookmark_enrichment import BookmarkEnrichment
# Initialize the enrichment system
enrichment = Bookm
8000
arkEnrichment(cache_dir='./cache')
# Enrich a single bookmark
bookmark = {"url": "https://github.com/username/repo"}
enriched = enrichment.enrich_bookmark(bookmark)
# Enrich multiple bookmarks
bookmarks = [
{"url": "https://github.com/username/repo"},
{"url": "https://twitter.com/username/status/123456789"}
]
enriched_bookmarks = enrichment.enrich_bookmarks(bookmarks)
This project provides tools to fetch Twitter threads and convert them into nicely formatted markdown files with collapsible blocks for better readability.
- Fetch complete Twitter threads using GraphQL API
- Extract media (images and videos) with highest quality
- Convert mentions and hashtags to clickable links
- Generate clean markdown output with collapsible sections
- Save raw data for future reference
- Comprehensive logging for debugging
- Clone the repository:
git clone <repository-url>
cd deltaload
- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Create a
.env
file in the project root with your Twitter credentials:
TWITTER_USER_ID=your_user_id
TWITTER_CT0=your_ct0_cookie
TWITTER_AUTH_TOKEN=your_auth_token
To get these credentials:
- Log in to Twitter in your browser
- Open Developer Tools (F12)
- Go to the Network tab
- Make any request on Twitter
- Look for the following in the request headers:
ct0
cookie value for TWITTER_CT0auth_token
cookie value for TWITTER_AUTH_TOKEN- Your user ID can be found in the
twid
cookie (remove the 'u=' prefix)
from twitter.conversation_fetcher import TwitterThreadFetcher
# Initialize the fetcher
fetcher = TwitterThreadFetcher()
# Fetch a thread by its tweet ID
thread = fetcher.get_conversation_thread("TWEET_ID")
# Save to JSONL format
output_file = f"thread_{thread[0]['id']}.jsonl"
fetcher.save_thread_to_jsonl(thread, output_file)
from twitter.markdown_formatter import process_thread
# Convert JSONL to markdown
jsonl_file = "thread_TWEET_ID.jsonl"
process_thread(jsonl_file, output_dir="output")
The markdown output will be saved in the specified output directory with the following structure:
# Twitter Thread
## Main Tweet
[Tweet content with formatted links and mentions]
<details><summary>Metadata</summary>
- Date: [formatted date]
- Tweet ID: [link to tweet]
</details>
## Thread
### Tweet 1
[Tweet content]
**Media:**
[Images/Videos if present]
---
### Tweet 2
[Tweet content]
...
## Raw Data
<details><summary>Click to expand raw data</summary>
[JSON data]
</details>
The scripts use the loguru
library for logging. Logs are saved to:
twitter_thread.log
for the fetchermarkdown_formatter.log
for the formatter
Both scripts also output INFO level logs to the console.
The scripts include comprehensive error handling and logging:
- Network request failures
- API response parsing errors
- File I/O errors
- Invalid credentials
- Missing environment variables
Check the log files for detailed error messages and debugging information.
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.