sctpublic is an open-source project that provides a scalable framework to evaluate clinical reasoning in large language models (LLMs) using Script Concordance Tests (SCTs). In this project, we compare the performance of various state-of-the-art LLMs (including GPT-4o, o1-preview, Claude 3.5 Sonnet, and Gemini-1.5-Pro) against clinician benchmarks on a diverse set of SCT questions.
Script Concordance Testing is a validated medical assessment tool designed to evaluate clinical reasoning under uncertainty. Unlike traditional multiple-choice questions, SCTs measure how new information alters diagnostic and treatment hypotheses—a critical aspect of real-world clinical decision-making.
Key highlights of this project:
- Benchmark Composition: 750 SCT questions drawn from diverse international datasets.
- Model Evaluation: Analysis of LLM performance (zero-shot and few-shot, with or without reasoning).
- Human Comparison: Comparisons against performance metrics of medical students, residents, and attending physicians.
This public repository distributes SCT questions exclusively from the Open Medical SCT and Adelaide SCT datasets. These questions are openly available for use and distribution.
Access to the full set of SCT questions, including additional proprietary or sensitive datasets, is not provided here. Please refer to the competition guidelines on our project paper for further instructions on how submit models for testing against this final set.
-
Clone the repository:
git clone https://github.com/yourusername/sctpublic.git cd sctpublic
-
Set up a virtual environment (optional but recommended):
python -m venv venv source venv/bin/activate # On Windows use: venv\Scripts\activate
-
Install the required packages:
pip install -r requirements.txt
If you are using a notebook environment (e.g., Colab), install additional packages as mentioned at the top of the notebooks.
-
Environment Variables:
Create a
.env
file in the project root with the following keys:OPENAI_API_KEY
ANTHROPIC_API_KEY
GOOGLE_APPLICATION_CREDENTIALS
(should point to your JSON credentials file for Google Vertex AI)
The project is structured as a combination of Python scripts and Jupyter notebooks:
-
Data Processing and Prompt Generation:
Check outmodeling.ipynb
or runmodeling.py
to load SCT data, generate prompt templates, and process prompts for each question. -
Model Evaluation:
Use the notebooks (e.g.,modeling.ipynb
andfinalizer.ipynb
) to send prompts to your LLM endpoints and record responses. -
Analysis:
dataanalysis.ipynb
provides tools to compute statistics, compare model performances, and generate visualizations.
You can also run evaluation scripts from the command line if desired.
Contributions are welcome! Feel free to open issues or submit pull requests. When contributing, please follow the coding conventions and ensure your changes are covered by tests where applicable.
This project is licensed under the MIT License. See the LICENSE file for more details.
We extend our gratitude to the research teams and medical experts who have contributed their expertise and data. Special thanks to all the authors and collaborators whose support has enabled this work.