8000 thecatfix's list / Data Extraction · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
View thecatfix's full-sized avatar
💬
Will reply if u direct mention me on GitHub or Twitter/X.
💬
Will reply if u direct mention me on GitHub or Twitter/X.

Sponsoring

@withastro
@linkarzu
@ryoppippi
@badlogic

Highlights

  • Pro

Block or report thecatfix

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Stars

Data Extraction

33 repositories
Python 2 Updated May 14, 2024

Easy access to IAB Tech Lab taxonomies, including Content, Audience and Ad Product

136 41 Updated May 20, 2025

A demo Jupyter Notebook showcasing a simple local RAG (Retrieval Augmented Generation) pipeline to chat with your PDFs.

Jupyter Notebook 431 168 Updated Jun 17, 2025

wallabag is a self hostable application for saving web pages: Save and classify articles. Read them later. Freely.

PHP 11,626 811 Updated Jul 2, 2025
HTML 13 36 Updated Jun 20, 2025

DuckDB is an analytical in-process SQL database management system

C++ 30,676 2,411 Updated Jul 2, 2025

SemanticPDF: Drag, Drop, Semantic Search - SemanticPDF is a simple, privacy-focused application that makes it easy to upload a PDF file and perform a semantic search on contents.

TypeScript 66 9 Updated Apr 4, 2024

💫 Industrial-strength Natural Language Processing (NLP) in Python

Python 31,872 4,530 Updated May 28, 2025

Code I wrote for my AI & LLM workshops

Jupyter Notebook 438 156 Updated Jun 27, 2025

Export/Backup Spotify playlists using the Web API

TypeScript 3,592 462 Updated May 27, 2025

Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website …

HTML 11,790 973 Updated Jul 1, 2025

🎓 Practical beginner-level introductions to using different tools and technologies, with a focus on their application in the newsroom

Makefile 82 21 Updated Oct 9, 2022

A network filesystem client to connect to SSH servers

C 6,815 520 Updated Mar 11, 2025

Scipy Cookbook

Jupyter Notebook 473 179 Updated May 14, 2024

A time-series database for high-performance real-time analytics packaged as a Postgres extension

C 19,440 951 Updated Jul 2, 2025

Data files (.csv) accessed with nflscrapR and summarized at the player-level

HTML 375 206 Updated Mar 2, 2020

hudi-packages-connectors is a library that provides a toolset to parse and extract relevant information from the personal data sources provided by major websites or social networks.

TypeScript 11 2 Updated Jul 11, 2024

Sobering things about Excel

58 2 Updated Jul 18, 2016

Linkedin Scraper using Selenium Web Driver, Chromium headless, Docker and Scrapy

Python 914 141 Updated Jun 29, 2025

Public resources related to Thinkful's data science bootcamp

Jupyter Notebook 58 129 Updated Mar 22, 2021

Dead simple pdf text reader

TypeScript 40 6 Updated May 8, 2024

using XPDF, pdftojson extracts text from PDF files as JSON, including word bounding boxes.

C++ 145 15 Updated Nov 4, 2023

Turn images of tables into CSV data. Detect tables from images and run OCR on the cells.

Python 520 112 Updated Mar 3, 2021
Python 103 28 Updated Jan 1, 2025

Swiss-army tool for scraping and extracting data from online assets, made for hackers

Go 3,477 116 Updated Oct 12, 2024

Like jq, but for HTML.

Rust 7,313 122 Updated May 29, 2024

🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.

TypeScript 41,971 3,952 Updated Jul 3, 2025

ScrapPY is a Python utility for scraping manuals, documents, and other sensitive PDFs to generate wordlists that can be utilized by offensive security tools to perform brute force, forced browsing,…

Python 208 23 Updated May 2, 2025

Parsing HTML at the command line

HTML 8,283 265 Updated May 2, 2024
0