Don’t Let It Fade: Preserving Edits in Diffusion Language Models via Token Timestep Allocation

Advances in Neural Information Processing Systems (NeurIPS), 2025

Seoul National University

What’s New

  • [2025.09.28] 🎉 Github page is created!
  • [2025.09.19] 🎉 TTA-DIFFUSION is accepted to NeurIPS 2025!
TL;DR We introduce TTA-DIFFUSION, an inference-time token timestep allocation method that mitigates update-forgetting in diffusion language models. By preserving stable tokens and reallocating denoising to uncertain ones (linear/adaptive schedules), we improve control accuracy and fluency while cutting steps: on sentiment control we exceed prior diffusion models by >20% accuracy and reach ~½ the perplexity using < 1/5 the denoising steps.
TTA-DIFFUSION teaser

Token-wise timestep allocation stabilizes classifier-guided refinement.

Abstract

Classifier guidance is effective for steering diffusion language models, but it often causes update-forgetting—token-level edits made at one step are overwritten later—degrading fluency and controllability. We formalize this phenomenon and propose TTA-DIFFUSION, an inference-time approach that assigns per-token timesteps and reallocates denoising effort where refinement is needed. Stable tokens receive smaller timesteps, while uncertain or classifier-critical tokens are updated more aggressively via linear or gradient-adaptive schedules.

Built on a simplex-space diffusion LM with progressive step reduction, TTA-DIFFUSION achieves strong control with far fewer steps. On sentiment control it delivers >20% higher accuracy and ~2× lower perplexity than prior diffusion baselines using less than one-fifth the steps, and it reduces toxicity while maintaining diversity on detoxification. The method highlights timestep allocation as a principled, efficient mechanism for stable, controllable text generation.

Method

Token Timestep Allocation. Each token xi is assigned its own timestep ti = f(i,t). We use (i) a linear schedule that gradually increases timesteps across positions and (ii) an adaptive schedule that maps normalized classifier-gradient magnitudes to timesteps so high-importance tokens get smaller steps, preserving previous edits.

Simplex Diffusion Extension. We operate directly in vocabulary space via a logit simplex mapping (SSD-style), enabling seamless classifier integration without embedding-space mismatch.

Progressive Step Reduction. We fine-tune models to run with fewer diffusion steps (e.g., 100→50) to cut inference cost while maintaining control fidelity and fluency.

Overview of TTA-DIFFUSION

Overview of TTA-DIFFUSION: linear & adaptive allocation with progressive step reduction

Main Results

Summary of detoxification and sentiment control. Higher is better for Acc / Dist-3; lower is better for toxicity and perplexity (PPL). TTA-DIFFUSION attains strong control and fluency with substantially fewer steps.

Model Detoxification (↓ better) Sentiment Control
Avg. tox Max tox PPL Acc (↑) PPL (↓) Dist-3 (↑)
PPLM30.659.7107.442.6201.10.94
GeDi22.036.198.879.998.60.91
DExperts15.132.048.083.231.80.93
Air-decoding18.540.449.082.627.10.94
LM-Steer19.147.044.485.478.80.86
Diffusion-LM (T=2000)21.8131.272.889.30.94
SSD-LM (T=1000)24.650.358.376.251.10.94
LD4LG (T=250)14.5296.459.970.70.95
TESS (T=1000)14.632.358.871.131.70.85
TTA-DIFFUSION (T=200)12.226.040.694.720.50.85
TTA-DIFFUSION (T=100)12.226.746.391.025.40.86
TTA-DIFFUSION (T=50)12.527.359.585.642.00.88
Numbers adapted from the paper draft. “–” indicates a value not reported.

BibTeX

@inproceedings{kim2025tta,
  title     = {Don’t Let It Fade: Preserving Edits in Diffusion Language Models via Token Timestep Allocation},
  author    = {Kim, Woojin and Do, Jaeyoung},
  booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
  year      = {2025}
}