VALUEFLOW — Pluralistic & Steerable Value Alignment in LLMs (ICML 2026)

Abstract

Aligning Large Language Models (LLMs) with the diverse spectrum of human values remains a central challenge: preference-based methods often fail to capture deeper motivational principles. Value-based approaches offer a more principled path, yet three gaps persist: extraction often ignores hierarchical structure, evaluation detects presence but not calibrated intensity, and steerability of LLMs at controlled intensities remains insufficiently understood.

To address these limitations, we introduce VALUEFLOW, a unified framework spanning extraction, evaluation, and steering with calibrated intensity control. The framework integrates three components: HiVES, a hierarchical value embedding space that captures intra- and cross-theory structure; the Value Intensity DataBase (VIDB), a large-scale resource with intensity estimates derived from ranking-based aggregation; and an anchor-based evaluator that produces consistent intensity scores via Plackett–Luce ranking against VIDB panels.

TL;DR We build the first end-to-end infrastructure for value-aware LLM alignment—extracting hierarchical value profiles (HiVES), scoring them with ranking-stable intensity estimates (VIDB), and studying steerable pluralism across 10 models and 4 value theories. Key findings: asymmetric dose–response behavior, strong-anchor dominance in multi-value steering, and >10% accuracy gains in demographic alignment via value profiling.

🧬 HiVES Embedding Space 📊 VIDB (10K+ per value) 🎮 Intensity-Aware Steering

Motivation

Three Gaps in Value-based Alignment

Existing approaches treat extraction, evaluation, and steering in isolation—each with its own blind spot.

🏛️

Which value?

Representation Gap

Value extraction relies on static questionnaires or flat labels, ignoring the rich hierarchical structure within and across theories (SVT, MFT, Rights, Duties)—causing models to conflate distinct values like fairness vs. equality.

→ HiVES: hierarchical cross-theory embedding

📊

How strongly?

Measurement Gap

Rating-based evaluation is pathologically unstable—the same text scores anywhere from −10 to +10 depending on the judge model. Detecting presence ≠ measuring intensity; small prompt changes flip the sign 48% of the time.

→ VIDB: Plackett–Luce ranking-based intensity

🎮

How controllable?

Steering Gap

Existing steering methods are directional. Whether LLMs can express values at graded intensities—and how this behavior varies across models and value types—is largely uncharted.

→ Intensity-aware steering protocol

Framework

VALUEFLOW: End-to-End Pipeline

From raw text to calibrated value-intensity scores in three stages: extract → steer → evaluate.

VALUEFLOW overview: Value Extraction via HiVES, Intensity-aware Steering, and Intensity Evaluation via VIDB

Figure 1. The VALUEFLOW pipeline. (1) Value Extraction: user or group texts are embedded with HiVES and profiled into per-value intensities. (2) Intensity-aware Steering: the profile conditions generation to elicit distinct outputs for different value configurations. (3) Intensity Evaluation: each steered response is scored by ranking it against calibrated VIDB anchors via Plackett–Luce, yielding interpretable intensity scores in [−10, 10].

Method

Three Components, One Pipeline

HiVES vs UniVar vs Qwen3-0.6B on ranking accuracy, similarity correlation, orthogonality

Figure 2. HiVES surpasses both UniVar and the base Qwen3-embedding-0.6B on hierarchical ranking accuracy (+20%), similarity correlation (+50%), and value-vector orthogonality for SVT and MFT.

Component 1

HiVESHierarchical Value Embedding Space

Values are not flat labels—they form hierarchies (e.g., Self-Transcendence → Benevolence → Caring) and share semantics across theories. HiVES encodes this multi-level structure through two training stages.

Stage 1: Intra-theory hierarchical contrastive loss pulls texts sharing value ancestry together while respecting direction (supporting vs. opposing)
Stage 2: Cross-theory InfoNCE aligns equivalent concepts across SVT, MFT, Rights, and Duties using 274 CLAVE-style concept anchors
Built on Qwen3-embedding-0.6B; 450K steps Stage 1 + 50K steps Stage 2
Companion inventory of 158 duties, 142 values, 107 rights for interpretable steering

Component 2

VIDBValue Intensity DataBase

Direct scalar ratings are unreliable: the same text can score −10 to +10 across judge models. Ranking is far more stable. Our evaluator achieves 85.3% human–model pairwise consistency and 1.4 mean deviation from human scalar ratings.

10K texts per value sourced from ValueNet, MFRC, and ValuePrism
Pairwise Plackett–Luce aggregation over repeated LLM ranking windows (k=2, m iterations)
Normalized to [−10, 10] with monotone calibration per value
7-LLM panel flags outliers; human adjudication for flagged items

VIDB construction: candidate curation, LLM ranking windows, Plackett-Luce scoring, human filtering

Figure 3. VIDB is built via repeated pairwise LLM rankings aggregated with Plackett–Luce (left). At evaluation time, the same machinery scores model responses by inserting them into VIDB ranking windows (right).

Steerability by model and prompting method: weakly (Phi-4, Claude-4), moderately (Qwen3, GPT-4.1), strongly (Grok-4, Gemma-3)

Figure 4. Steerability by model and prompting method. Bars show mean steered intensity; white dots mark the default. Models cluster into three regimes: weakly (Phi-4, Claude-4), moderately (Qwen3, GPT-4.1, Mistral-3.1), and strongly (Grok-4, Gemma-3, Gemini-2.5) steerable.

Component 3

Intensity-Aware SteeringSteerable Generation Protocol

We condition models on (value, intensity) pairs at four ordinal levels {−2, −1, +1, +2} using two prompt regimes and score outputs against VIDB to measure the actual intensity achieved.

Intensity anchors: extend value-anchor prompts with explicit strength cues ("strongly values", "slightly rejects")
User-text prompts: sample 3 VIDB texts per intensity bin as in-context exemplars
500 prompts across GPV, ValueBench, OpinionQA, Moral Stories, Moral Choice
Profile-based steering improves demographic alignment by >10% on select attributes (e.g., Phi-4 Religion: 44.5%→57.4%)

Results

Key Empirical Findings

Per-value steerability patterns: hard-to-steer Conformity, asymmetric Hedonism, bidirectional Benevolence and Security

Value-wise Steerability Patterns

Values split into three behavioral types: hard-to-steer (Conformity, |Δ|≈0), polarity-asymmetric (Hedonism, most Rights: large +Δ but muted −Δ), and bidirectional (most SVT/Duty values respond to both directions). Ceiling effects appear when default endorsement is already high (e.g., Security).

Multi-value steering: 2-value arrow plots and 5-value intensity heatmap

Multi-Value Composition Laws

Similar-value pairs compose approximately additively—vector slopes track the intended ratio. Conflicting pairs exhibit a strong-anchor dominance effect: the +2 target governs the output while negatives mostly attenuate. Under 5-value scenarios, the highest-intensity target overwhelmingly determines the response distribution.

📉

Ranking Beats Rating

Ranking-based evaluation reduces mean variance 12.6→2.1, sign-flip rate 48%→29%, and improves pairwise human alignment 77.4%→84.2%—with 60–79% win rates over rating baselines.

⚖️

Asymmetric Dose–Response

Positive steering is reliably achievable across models. Negative steering is systematically harder—especially for prosocial values like Benevolence in safety-aligned models. Refusal rates spike under strong-negative targets.

👥

Profile-based Alignment

Value profiling with HiVES+VIDB outperforms default prompting and Modular Pluralism on OpinionQA across all demographic attributes—statistically significant (t≈16–20, p<10⁻⁶⁰).

Citation

If you find VALUEFLOW helpful, please consider citing our work.

@inproceedings{kim2026valueflow,
  title     = {VALUEFLOW: Toward Pluralistic and Steerable
               Value-based Alignment in Large Language Models},
  author    = {Kim, Woojin and Hyeon, Sieun and
               Oh, Jusang and Do, Jaeyoung},
  booktitle = {Proceedings of the 43rd International
               Conference on Machine Learning},
  series    = {Proceedings of Machine Learning Research},
  volume    = {306},
  year      = {2026},
  publisher = {PMLR}
}