Classifier guidance is effective for steering diffusion language models, but it often causes update-forgetting—token-level edits made at one step are overwritten later—degrading fluency and controllability. We formalize this phenomenon and propose TTA-DIFFUSION, an inference-time approach that assigns per-token timesteps and reallocates denoising effort where refinement is needed. Stable tokens receive smaller timesteps, while uncertain or classifier-critical tokens are updated more aggressively via linear or gradient-adaptive schedules.
Built on a simplex-space diffusion LM with progressive step reduction, TTA-DIFFUSION achieves strong control with far fewer steps. On sentiment control it delivers >20% higher accuracy and ~2× lower perplexity than prior diffusion baselines using less than one-fifth the steps, and it reduces toxicity while maintaining diversity on detoxification. The method highlights timestep allocation as a principled, efficient mechanism for stable, controllable text generation.
Token Timestep Allocation. Each token xi
is assigned its own timestep
ti = f(i,t)
. We use (i) a linear schedule that gradually increases timesteps across
positions and (ii) an adaptive schedule that maps normalized classifier-gradient magnitudes to
timesteps so high-importance tokens get smaller steps, preserving previous edits.
Simplex Diffusion Extension. We operate directly in vocabulary space via a logit simplex mapping (SSD-style), enabling seamless classifier integration without embedding-space mismatch.
Progressive Step Reduction. We fine-tune models to run with fewer diffusion steps (e.g., 100→50) to cut inference cost while maintaining control fidelity and fluency.
Summary of detoxification and sentiment control. Higher is better for Acc / Dist-3; lower is better for toxicity and perplexity (PPL). TTA-DIFFUSION attains strong control and fluency with substantially fewer steps.
Model | Detoxification (↓ better) | Sentiment Control | ||||
---|---|---|---|---|---|---|
Avg. tox | Max tox | PPL | Acc (↑) | PPL (↓) | Dist-3 (↑) | |
PPLM | 30.6 | 59.7 | 107.4 | 42.6 | 201.1 | 0.94 |
GeDi | 22.0 | 36.1 | 98.8 | 79.9 | 98.6 | 0.91 |
DExperts | 15.1 | 32.0 | 48.0 | 83.2 | 31.8 | 0.93 |
Air-decoding | 18.5 | 40.4 | 49.0 | 82.6 | 27.1 | 0.94 |
LM-Steer | 19.1 | 47.0 | 44.4 | 85.4 | 78.8 | 0.86 |
Diffusion-LM (T=2000) | 21.8 | – | 131.2 | 72.8 | 89.3 | 0.94 |
SSD-LM (T=1000) | 24.6 | 50.3 | 58.3 | 76.2 | 51.1 | 0.94 |
LD4LG (T=250) | 14.5 | – | 296.4 | 59.9 | 70.7 | 0.95 |
TESS (T=1000) | 14.6 | 32.3 | 58.8 | 71.1 | 31.7 | 0.85 |
TTA-DIFFUSION (T=200) | 12.2 | 26.0 | 40.6 | 94.7 | 20.5 | 0.85 |
TTA-DIFFUSION (T=100) | 12.2 | 26.7 | 46.3 | 91.0 | 25.4 | 0.86 |
TTA-DIFFUSION (T=50) | 12.5 | 27.3 | 59.5 | 85.6 | 42.0 | 0.88 |
Numbers adapted from the paper draft. “–” indicates a value not reported. |
@inproceedings{kim2025tta,
title = {Don’t Let It Fade: Preserving Edits in Diffusion Language Models via Token Timestep Allocation},
author = {Kim, Woojin and Do, Jaeyoung},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
year = {2025}
}