Class Vectors for Classifier Editing

Exploring and Leveraging Class Vectors for Classifier Editing

¹AIDAS Laboratory, IPAI, Seoul National University ²ECE, Seoul National University

NeurIPS 2025

Overview of Class Vector framework: capturing class-level representation shifts and applying them for unlearning, adaptation, and adversarial control.

Abstract

Image classifiers play a critical role in detecting diseases in medical imaging and identifying anomalies in manufacturing processes. However, their predefined behaviors after extensive training make post hoc model editing difficult, especially when it comes to forgetting specific classes or adapting to distribution shifts. Existing classifier editing methods either focus narrowly on correcting errors or incur extensive retraining costs, creating a bottleneck for flexible editing. Moreover, such editing has seen limited investigation in image classification. To overcome these challenges, we introduce Class Vectors, which capture class-specific representation adjustments during fine-tuning. Whereas task vectors encode task-level changes in weight space, Class Vectors disentangle each class’s adaptation in the latent space. We show that Class Vectors capture each class’s semantic shift and that classifier editing can be achieved either by steering latent features along these vectors or by mapping them into weight space to update the decision boundaries. We also demonstrate that the inherent linearity and orthogonality of Class Vectors support efficient, flexible, and high-level concept editing via simple class arithmetic. Finally, we validate their utility in applications such as unlearning, environmental adaptation, adversarial defense, and adversarial trigger optimization.

Key Findings

1Class-level Latent Directions

Class-vector directions aligned across models.

Class Vectors disentangle class-specific adaptations as $κ c = E[f(s;θ ft)] - E[f(s;θ pre)]$ , enabling class-wise edits with simple arithmetic.

2Linearity & Independence

Class interpolation overview showing smooth transitions. — Visualization of smooth inter-class interpolation, demonstrating linear trajectories between class centroids.

Measured class-vector independence across tasks. — Empirical confirmation of orthogonality among Class Vectors, validating inter-class independence predicted by Neural Collapse.

Inter-class interpolation is smooth; edits to a target class minimally affect others, supported by CTL and Neural Collapse structure.

3Two Injection Modes

Latent steering enables training-free edits by steering class-relevant latent representations gated by cosine similarity, while weight mapping embeds such edits permanently into model weights via lightweight fine-tuning of the final block, preserving deterministic decision boundaries.

4Efficient & Controllable

Class Vectors require only a few reference samples (often <5 per class) and support scalable control of edit strength via a single scalar λ, maintaining performance even in low-data regimes (≤30% of samples).

Method

We model Class Vectors as per-class latent shifts that summarize how features move from a pretrained encoder to a fine-tuned one. For a class $c$ , the vector $κ c$ is computed from penultimate features averaged over a small reference set. These vectors allow two practical edit modes: latent steering at inference time, and weight mapping for persistent edits.

Estimate class vectors. Collect a few labeled samples per class; extract penultimate features from both pretrained and fine-tuned encoders; take the mean difference to obtain $κ c$ .
Latent steering (training-free). For an input feature $r$ , apply a gated edit $r' = r + β\cdotλ\cdot(κ target)$ , where $β$ is a cosine-similarity gate to limit edits to relevant regions and $λ$ controls strength.
Weight mapping (persistent edit). Map the desired latent shift into small weight deltas in late blocks (e.g., classifier head or last encoder block) and update parameters to obtain an edited model.

Pseudo Algorithm

Algorithm visualization detailing class vector estimation and edits.

Pseudo-algorithm illustrating two editing pipelines—(left) latent-space steering shifts features gated by cosine similarity, while (right) weight-space mapping embeds the same semantic shift through minimal encoder updates for persistent edits.

Applications

1Unlearning

Unlearning comparison with baselines. — Comparison of class unlearning with baselines, including the mean and std accuracies.

Steer the model along the negative class vector to erase class-specific predictive rules without additional retraining, keeping the rest of the decision boundary intact.

2Environment Adaptation

Environment adaptation edits mitigating snow effects. — Class Vector–based environment adaptation subtracts snow responses from object embeddings to restore accurate predictions.

Subtract snow-specific activations while preserving object identity to regain robustness on Snowy ImageNet scenes.

3Typography Defense

Typography defense recovering clean predictions. — Typography defense removes spurious text features to prevent mislabeled signage attacks.

Subtract text-induced features injected by typography attacks so the classifier reverts to clean object cues (e.g., defeating "iPod" illusions).

4Trigger Optimization

Trigger optimization on GTSRB controlling predictions. — Optimized trigger patches emulate latent shifts to redirect traffic sign predictions.

Optimize pixel-space trigger patches that approximate a target class shift, allowing controlled backdoor redirects without modifying network weights.

Citation


  @inproceedings{
    anonymous2025exploring,
    title={Exploring and Leveraging Class Vectors for Classifier Editing},
    author={Anonymous},
    booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
    year={2025},
    url={https://openreview.net/forum?id=jWrDyknUZ8}
    }