Stars
This repository includes a benchmark and code for the paper "Evaluating LLMs at Detecting Errors in LLM Responses".
Arena-Hard-Auto: An automatic LLM benchmark.
This is the repo for the paper Shepherd -- A Critic for Language Model Generation
Code base for ICLR 2024 "Fast-DetectGPT: Efficient Zero-Shot Detection of Machine-Generated Text via Conditional Probability Curvature".
A survey and reflection on the latest research breakthroughs in LLM-generated Text detection, including data, detectors, metrics, current issues and future directions.
potato: portable text annotation tool
A library for advanced large language model reasoning
[ACL 2023] Reasoning with Language Model Prompting: A Survey
KokoMind: Can LLMs Understand Social Interactions?
Reasoning with Language Model is Planning with World Model
ChatGLM-6B: An Open Bilingual Dialogue Language Model | 开源双语对话语言模型
Instruct-tune LLaMA on consumer hardware
Source code and data for The Magic of IF: Investigating Causal Reasoning Abilities in Large Language Models of Code (Findings of ACL 2023)
WikiWhy is a new benchmark for evaluating LLMs' ability to explain between cause-effect relationships. It is a QA dataset containing 9000+ "why" question-answer-rationale triplets.
Repo for Generating Flashbacks in Stories (NAACL'22)
NAACL Paper (IMHO Fine-Tuning Improves Claim Detection)
Python client for Moss: A System for Detecting Software Similarity
Assessing Humor in Edited News Headlines
EMNLP 2021 - CTC: A Unified Framework for Evaluating Natural Language Generation