Stars
Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
Awesome-LLM-Robustness: a curated list of Uncertainty, Reliability and Robustness in Large Language Models
We collect papers about "large language models (LLM) for table-related tasks", e.g., using LLM for Table QA task. “表格+LLM”相关论文整理
Fine-tuning & Reinforcement Learning for LLMs. 🦥 Train Qwen3, Llama 4, DeepSeek-R1, Gemma 3, TTS 2x faster with 70% less VRAM.
Minimal sharded dataset loaders, decoders, and utils for multi-modal document, image, and text datasets.
A high-performance Python-based I/O system for large (and small) deep learning problems, with strong support for PyTorch.
We release the UICaption dataset. The dataset consists of UI images (icons and screenshots) and associated text descriptions. This dataset was used to pre-train the Lexi model which provides a gene…
The official implementation of “Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training”
The Screen Annotation dataset consists of pairs of mobile screenshots and their annotations. The annotations are in text format, and describe the UI elements present on the screen: their type, loca…
A One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
ScreenQA dataset was introduced in the "ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots" paper. It contains ~86K question-answer pairs collected by human annotators for ~35K…
[NeurIPS'23 Spotlight] "Mind2Web: Towards a Generalist Agent for the Web" -- the first LLM-based web agent and benchmark for generalist web agents
Mobile App Tasks with Iterative Feedback (MoTIF): Addressing Task Feasibility in Interactive Visual Environments
MiniCPM4: Ultra-Efficient LLMs on End Devices, achieving 5+ speedup on typical end-side chips
An open-source framework for training large multimodal models.
The model, data and code for the visual GUI Agent SeeClick
Machine Learning Engineering Open Book
MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
The dataset includes widget captions that describes UI element's functionalities. It is used for training and evaluation of the widget captioning model (please see the EMNLP'20 paper: https://arxiv…
It includes two datasets that are used in the downstream tasks for evaluating UIBert: App Similar Element Retrieval data and Visual Item Selection (VIS) data. Both datasets are written TFRecords.
Repo for "Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture"
Simple, minimal implementation of the Mamba SSM in one file of PyTorch.
Implementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU)
OCR Annotations from Amazon Textract for Industry Documents Library