Superior Scoring Rules: Enhanced Calibrated Scoring Rules for Probabilistic Evaluation

📊 PBS and PLL are strictly proper scoring rules and superior evaluation metrics for probabilistic classifiers, fixing flaws in Brier Score (MSE) and Log Loss (Cross-Entropy). Strictly proper, consistent, and better for model selection, early stopping, and checkpointing.

Motivation

In many high-stakes applications, confidence calibration is critical. Traditional accuracy-based metrics (Accuracy, F1) ignore prediction confidence. Consider:

Cancer Diagnosis: Differentiating 51% vs. 99% confidence in malignancy
ICU Triage: Overconfident mispredictions risk patient safety
Autonomous Vehicles: Handling uncertainties about obstacles
Financial Risk Modeling: Pricing and investment decisions
Security Threat Detection: High-confidence false negatives

Accuracy or F1 score alone cannot capture this nuance.

Limitations of Traditional Metrics

While Brier Score (Mean Squared Error, MSE, Quadratic Score) and Log Loss (Cross-Entropy, Negative Log-Likelihood, NLL, Logarithmic Score) are strictly proper scoring rules, they can still favor incorrect, overconfident predictions over more calibrated, correct ones.

Case	True Class	Prediction	Brier Score	Log Loss	Notes
A	`[0,1,0]`	`[0.33,0.34,0.33]`	0.6534	0.4685	✅ Correct, but low confidence
B	`[0,1,0]`	`[0.51,0.49,0.00]`	0.5202	0.3098	❌ Incorrect, but "better" score

Traditional scores prefer B over A, violating the principle that correct predictions should always be rewarded.

Penalized Scoring Rules

We introduce a penalty term that ensures any incorrect prediction is scored worse than any correct one.

Definitions

Let y be the one‑hot true vector, p the predicted probability vector, and c the number of classes. Define the set of predictions:

$$\xi = \{\,p \mid \arg\max p \neq \arg\max y\}\quad\text{(incorrect predictions)}$$

Formulas

Then the Penalized Brier Score (PBS) is:

$$S_{PBS}(p,i) = \sum_{i=1}^{c}(y_i-p_i)^2 + \begin{cases} \frac{c-1}{c} & p \in \xi\\ 0 & \text{otherwise} \end{cases}$$

And the Penalized Logarithmic Loss (PLL) is:

$$S_{PLL}(p,i) = - \sum_{i=1}^{c} y_i \log(p_i) - \begin{cases} \log (\frac{1}{c}) & p \in \xi\\ 0 & \text{otherwise} \end{cases}$$

Implement 8000 ation

Penalized Brier Score (PBS)

def pbs(y, q):
    """
    Computes Penalized Brier Score.
    
    Args:
        y_true: Ground truth (one-hot encoded), shape [batch_size, num_classes]
        y_pred: Predicted probabilities, shape [batch_size, num_classes]
        
    Returns:
        Mean PBS across batch
    """
    y = tf.cast(y, tf.float32)
    c = y.get_shape()[1]

    # Calculate the payoff term
    ST = tf.math.subtract(q, tf.reduce_sum(tf.where(y == 1, q, y), axis=1)[:, None])
    ST = tf.where(ST < 0, tf.constant(0, dtype=tf.float32), ST)
    payoff = tf.reduce_sum(tf.math.ceil(ST), axis=1)
    M = (c - 1) / (c)
    payoff = tf.where(payoff > 0, tf.constant(M, dtype=tf.float32), payoff)
    
    # Brier score + penalty
    brier = tf.math.reduce_mean(tf.math.square(tf.math.subtract(y, q)), axis=1)
    return tf.math.reduce_mean(brier + payoff)

Penalized Logarithmic Loss (PLL)

def pll(y, q):
    """
    Computes Penalized Logarithmic Loss.
    
    Args:
        y_true: Ground truth (one-hot encoded)
        y_pred: Predicted probabilities
        
    Returns:
        Mean PLL across batch
    """
    y = tf.cast(y, tf.float32)
    c = y.get_shape()[1]

    # Calculate the payoff term
    ST = tf.math.subtract(q, tf.reduce_sum(tf.where(y == 1, q, y), axis=1)[:, None])
    ST = tf.where(ST < 0, tf.constant(0, dtype=tf.float32), ST)
    payoff = tf.reduce_sum(tf.math.ceil(ST), axis=1)
    M = math.log(1 / c)
    payoff = tf.where(payoff > 0, tf.constant(M, dtype=tf.float32), payoff)
    log_loss = tf.keras.losses.categorical_crossentropy(y, q)

    # Cross-entropy - penalty
    ce_loss = tf.cast(log_loss, tf.float32)
    return tf.math.reduce_mean(ce_loss - payoff)

Quick Start

Installation

Install via PyPI:

pip install superior-scoring-rules

Basic Usage

import tensorflow as tf
from superior_scoring_rules import pbs, pll

# Sample data (batch_size=3, num_classes=4)
y_true = tf.constant([[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 0, 1]])
y_pred = tf.constant([[0.9, 0.05, 0.05, 0], 
                     [0.1, 0.8, 0.05, 0.05],
                     [0.1, 0.1, 0.1, 0.7]])

print("PBS:", pbs(y_true, y_pred).numpy())
print("PLL:", pll(y_true, y_pred).numpy())

Callbacks for Early Stopping & Checkpointing

class PBSCallback(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs=None):
        logs = logs or {}
        logs['val_pbs'] = pbs(self.validation_data[1],
                              self.model.predict(self.validation_data[0]))

model.fit(...,
    callbacks=[
        PBSCallback(),
        tf.keras.callbacks.EarlyStopping(monitor='val_pbs', patience=5, mode='min'),
        tf.keras.callbacks.ModelCheckpoint('best.h5', monitor='val_pbs', save_best_only=True)
    ]
)

Project Structure

Below is an overview of the main files and folders:

├── Superior_Scoring_Rules.ipynb   # Implementation & analysis  
├── superior_scoring_rules.py      # PBS & PLL functions  
├── README.md                      # This file  
├── history/                       # Statistical analysis plots  
└── hyperparameters-tuning/        # Tuning results

Paper & Citation

@article{ahmadian2025superior,
  title={Superior scoring rules for probabilistic evaluation of single-label multi-class classification tasks},
  author={Ahmadian, Rouhollah and Ghatee, Mehdi and Wahlstr{\"o}m, Johan},
  journal={International Journal of Approximate Reasoning},
  pages={109421},
  year={2025},
  publisher={Elsevier}
}

Contributing

🐛 Report bugs via Issues
💡 Suggest improvements via Pull Requests
⭐️ Star the repository if you find it useful!

License

This project is licensed under the BSD License.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
datasets		datasets
figs		figs
history		history
hyperparameters-tuning		hyperparameters-tuning
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Superior_Scoring_Rules.ipynb		Superior_Scoring_Rules.ipynb
plot_histories.py		plot_histories.py
superior_scoring_rules.py		superior_scoring_rules.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Superior Scoring Rules: Enhanced Calibrated Scoring Rules for Probabilistic Evaluation

Table of Contents

Motivation

Limitations of Traditional Metrics

Penalized Scoring Rules

Definitions

Formulas

Implement 8000 ation

Quick Start

Installation

Basic Usage

Callbacks for Early Stopping & Checkpointing

Project Structure

Paper & Citation

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Ruhallah93/superior-scoring-rules

Folders and files

Latest commit

History

Repository files navigation

Superior Scoring Rules: Enhanced Calibrated Scoring Rules for Probabilistic Evaluation

Table of Contents

Motivation

Limitations of Traditional Metrics

Penalized Scoring Rules

Definitions

Formulas

Implement 8000 ation

Quick Start

Installation

Basic Usage

Callbacks for Early Stopping & Checkpointing

Project Structure

Paper & Citation

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages