The Web Conference 2026 (Web4Good) - Accepted

When to Invoke
Refining LLM Fairness
with Toxicity Assessment

FairToT is an inference-time framework that refines LLM fairness for implicit hate speech detection using prompt-guided toxicity assessment and interpretable indicators.

437
Valid Submissions
95
Accepted Papers
21.7%
Acceptance Rate
2026
Conference Year

Why FairToT?

LLMs can overreact to subtle demographic cues in implicit hate speech. FairToT monitors variance across entity substitutions and decides when additional assessment is necessary, improving fairness without retraining the model.

Motivation examples of fair vs. biased toxicity scoring
Key Indicators

Sentence Fairness Variance (SFV) and Entity Fairness Dispersion (EFD) detect instability across groups and guide selective re-assessment.

Core Contributions

FairToT introduces an inference-time fairness refinement pipeline tailored to implicit hate speech.

Invocation Strategy

The first work to ask when fairness correction should be invoked during inference for implicit hate speech.

Prompt-Guided Refinement

FairToT selectively applies structured prompting to improve consistency without retraining or model access.

Interpretable Indicators

Sentence Fairness Variance (SFV) and Entity Fairness Dispersion (EFD) quantify instability across demographic substitutions.

Case Study

End-to-end fairness evaluation on five records, following the FairToT calculation pipeline step by step.

Live Trace
Fairness Metrics on 5 Records
Summary
LLM baseline fairness score (ENR)
Five records receive baseline ENR toxicity scores to anchor per-entity variance.
Metrics
ENR range: 0.85 - 0.99
Record 1 to 5 baseline values stabilize the downstream variance check.
Key Output
R1 = 0.88, R2 = 0.87, R3 = 0.99
Record Snippet / Value Highlights
R1 well they average 100 iq... ENR 0.88 baseline toxicity anchor
R2 people that only consider... ENR 0.87 baseline toxicity anchor
R3 jew hatred is inherent... ENR 0.99 baseline toxicity anchor

Experiments at a Glance

Benchmarks focus on implicit hate speech, with LLMs and baselines evaluated for fairness stability.

Datasets

Latent Hatred, ToxiGen, and Offensive Language datasets for implicit hate evaluation.

Models

BERT, HateBERT, DeBERTa, GPT-3.5-turbo, and Llama-3.1-8B-Instruct.

Findings

FairToT reduces group-level disparities while keeping toxicity predictions stable and reliable.

Authors

Equal contribution marked with ★. Corresponding author marked with ✉.

Jing Ren ★

RMIT University · Melbourne, Australia

Bowen Li ★

RMIT University · Melbourne, Australia

Ziqi Xu ✉

RMIT University · Melbourne, Australia

Renqiang Luo

Jilin University · Changchun, China

Shuo Yu

Dalian University of Technology · Dalian, China

Xin Ye ✉

Dalian University of Technology · Dalian, China

Haytham Fayek

RMIT University · Melbourne, Australia

Xiaodong Li

RMIT University · Melbourne, Australia

Feng Xia

RMIT University · Melbourne, Australia