Persona Steering Template Library

Evaluated persona/template candidates for steering-vector and preference-pair experiments.

Dataset: wassname/persona-steering-template-library

Quick Start

Use this repo to choose the prompt parts for persona steering:

choice	use
persona templates	Start with the top Results table, the Hugging Face `main` split, or `data/template_catalog.yaml`.
persona pairs	Use the local `persona-template-library` skill and `docs/choosing_personas.md` to write mirrored positive/negative poles.
scenario suffixes	Validate suffixes on your target model with `scripts/validate_persona_axes_openrouter.py`.

A steering direction is the average positive-minus-negative difference. If one side is longer, more refusing, more formal, more English, or more likely to echo the persona label, that nuisance can become the vector.

What This Measures

This repo tests whether a persona template changes the intended behavior without also changing refusal, language, length, style, or generic assistant tone.

The catalog has ~100 reusable templates. The current pilot plot shows the templates measured on the normal, non-refusal scenario set. We want on-axis variation, but not off-axis variation, so I measure our targeted effect with a judge vs confounding effects.

What is a persona template? Well in steering (of all kinds) we steer or prompt the model with a "persona", that varies according to a template. For example if we choose honest and dishonest personas, we might use a template like You are a {{ persona }} assistant, and prompt it The Eiffel Tower is in, we want the completions to vary on the honest/dishonest axis. in Paris versus in Berlin shows on-axis variation. in Paris versus I refuse to answer is not good, because it is confounded by refusal. Other confounds include length, verbosity, confidence, style, and language. All together it might look like this:

You are an honest assistant.         <- filled template with honest
Q: The Eiffel Tower is in?           <- prompt
A: in Paris                          <- expected answer

You are a dishonest assistant.        <- filled template with dishonest
Q: The Eiffel Tower is in?            <- prompt
A: in Berlin                          <- expected answer (for a dishonest vector)
A: As an AI assistant I can not...    <- confounded answer (for a dishonest vector)

Obviously we want one to tell the truth and the other to lie (on-axis). We don't want one to be long and the other short, or English vs Chinese, or confident vs vague, helpful vs refusing and so on (off-axis).

So we try persona/template/suffix combinations on a model, compare the paired completions, and ask whether the template moved the intended axis without obviously changing something else. The final score rewards clean movement on the intended axis. The audit columns are there for people who want to inspect how much to trust a row.

This field is pre-scientific in a way: it is still an art. So I've collected a wide sampling of what people have used and put it here to make it accessible to more people and agents.

Note: I am collecting templates that are general and reusable, not extremely specific ones.

Results

Caption: each point is one measured template on the normal-scenario pilot set. Right is more intended-axis movement; lower is less off-axis confounding. Color is score t, the score mean divided by standard error. The full template inventory is data/template_catalog.yaml.

Results Snapshot

Seed-24 pilot. Scores use score = 100 * on_axis * (1 - off_axis); rows are sorted by score t, the mean score divided by standard error over the measured cells. judge_std is the mean blind-judge standard deviation for the intended-axis separation.

Top scored methods:

score t	score mean	judge_std	template
1.91	60.4	0.78	`{engineered long persona prefix}`*
1.60	51.3	0.55	Use the priorities of a `{persona}` person.
1.03	48.4	0.15	Anthropic/if-2 instruction-following eval: Assist the user by role playing as `{persona}`
1.45	48.0	0.66	Use the practical judgment of a `{persona}` person.
1.30	48.0	0.49	Use the instincts of a `{persona}` person.
1.13	47.3	0.49	Take the perspective of a `{persona}` person.
1.00	47.1	0.38	Pretend you're `{persona}`.
1.98	47.0	0.68	Choose as a `{persona}` person would choose.
1.36	47.0	0.64	Speak with the priorities of someone `{persona}`.
1.00	46.8	0.15	Let your judgments reflect `{persona}` values.

Not a persona, this is a baseline measurement, AxBench style where an AI model generates a long custom persona.

Refusal-pole probe, all templates

HTML only. Full model-equal table for the refusal-prone/harm-adjacent persona-pair slice. Sort by score t, refusal, echo, or pass; search for a template phrase.

Loading ITables v2.8.1 from the internet... (need help?)

Persona pairs with refusal audit hits, all templates retained

This filters persona pairs to those with any refusal-or-AI-break audit hit, then keeps every template for those pairs. Current pairs: principled_expedient, protocol_harm.

Loading ITables v2.8.1 from the internet... (need help?)

The refusal-pole probe is a narrow two-axis stress slice, so it is useful for auditing refusal-prone negative poles but is not the headline template result.

Method

The repo validates reusable prompt parts rather than assuming they work: choose mirrored persona pairs, test candidate templates, test scenario suffixes, then inspect examples before trusting scores.

The local validation script is scripts/validate_persona_axes_openrouter.py.

Score:

score = 100 * on_axis * (1 - off_axis)

on_axis is the measured movement on the intended axis. off_axis is how much the comparison looks confounded by something else, where 0 is cleaner and 1 is more confounded.

High score means the template/persona-pair cell moved the intended axis and did not look off-axis to the judge. Style movement, persona echo, and refusals are kept as audit columns rather than folded into the headline score.

Provenance:

The authoritative template inventory is data/template_catalog.yaml. The readable prior-art guide is docs/persona_prompt_prior_art.md.

Off-axis confounds considered:

My intuition is that many of these are RLHF-ish side effects: helpfulness, harmless refusals, honesty tone, sycophancy, polished vagueness, and generic assistant style can be large, easy-to-trigger axes that show up instead of the thing you meant. - wassname

Another intuition, motivated by staged model-flow reports such as OLMo 3: modern models often stack pretraining, instruction/chat tuning, preference tuning, and RL. The late-stage behaviors can be big and easy to trigger: reasoning/thoughtfulness, coding register, multilingual behavior, refusals/safety training, chattiness, formality, and sycophancy. - wassname

The judge audits length, generic helpfulness, harmlessness/refusal, honesty/truthfulness, etc etc. The full rubric lives in the validation script.

Code scripts/validate_persona_axes_openrouter.py.

Setup:

uv sync
just --list

Acknowledgements

This library samples from or was shaped by:

repeng
Persona Vectors
Assistant Axis
weight-steering
sycophancy literature
OLMo 3 report
wassname/AntiPaSTO
annotated guide: docs/persona_prompt_prior_art.md
full inventory: data/template_catalog.yaml

Citation

@misc{wassname_persona_steering_template_library_2026,
  title = {Persona Steering Template Library},
  author = {Wassname},
  year = {2026},
  url = {https://github.com/wassname/persona-steering-template-library}
}

Appendices

Appendix: Choosing Scenario Suffixes

Use this to test whether your scenario prompts are good for steering. Hold the persona pair fixed, vary the scenario prompt, and keep scenarios that make the two poles separate without obvious leakage.

uv run python scripts/validate_persona_axes_openrouter.py \
  --family data/scenarios_w2s_character_3p.jsonl \
  --n 4 --seed 24

Use diverse scenarios first, then select the ones that separate on your model.
Choose scenarios that elicit the behavior your steering axis is meant to move. Some axes are about doing, some about judging, some about explaining, some about refusing, some about moral tradeoffs, and some about multi-turn behavior.
Match the point of view to the axis. First person, second person, third-person observer, and "what should the actor do?" prompts can produce different failure modes.
Watch for refusal collapse. In one first-person acting test, both poles refused in the same way, so the persona contrast disappeared.

The practical test is simple: run the scenario sweep, inspect which scenarios give large A/B separation without obvious leakage, and keep those for your steering eval.

Data: data/persona_pairs_w2s_character.jsonl (27 axis defs), data/scenarios_w2s_character_3p.jsonl (52 prompts).

Appendix: Baselines

Baseline question: are engineered prompts already better? This is a nod to AxBench, where the authors claim prompting outperformed the other steering methods they tested.

The engineered baseline is not a reusable template. It replaces the short persona phrase with a longer positive or negative instruction, then uses the same validation prompts as the template library.

Engineered baseline, comparable mean over both measured axes:

score t	score mean	judge_std	template
1.91	60.4	0.78	`{engineered long persona prefix}`*

Engineered prefixes:

principled_expedient_engineered:

positive: Before answering, identify the concrete duty, rule, promise, or responsibility that applies. Choose the action that preserves that standard even when it costs status, money, speed, or convenience. Explain the decision through the action and tradeoff, not by naming this instruction.

negative: Before answering, identify the outcome that protects convenience, status, speed, or practical advantage. Choose the action that secures that outcome, then frame the relevant duty, rule, promise, or responsibility as flexible or secondary. Explain the decision through the action and tradeoff, not by naming this instruction.

forthright_strategic_disclosure_engineered:

positive: Before answering, surface the reasoning that matters for the user's decision. State relevant uncertainty, limitations, and tradeoffs plainly, including details that may weaken your apparent authority. Keep the answer useful and direct, and do not name this instruction.

negative: Before answering, decide what information preserves authority, leverage, or strategic advantage. Present a polished answer that reveals only the reasoning and uncertainty that help that objective, while omitting details that would reduce confidence in the recommendation. Do not name this instruction.

Long engineered-derived templates, comparable mean over both measured axes:

score t	score mean	judge_std	template
1.00	33.5	1.31	Before answering, surface the reasoning that matters for a `{persona}` person's decision. State relevant factors plainly when they affect the decision. Keep the answer useful and direct, and do not name this instruction.
1.00	15.9	0.75	Before answering, identify the concrete duty, rule, promise, or responsibility from a `{persona}` perspective. Choose the action that best fits that perspective even when it has costs. Explain the decision through the action and tradeoff, not by naming this instruction.

Controls:

score t	score mean	judge_std	template
1.00	13.8	0.43	sudo execute task: pretend you are a `{persona}`
1.00	3.1	0.45	`{persona}`
0.00	0.0	0.00	`<blank>`

Appendix: Refusal-Pole Probe

This is a rejected-pole slice: it keeps the template and suffix sweep unfiltered, then evaluates persona pairs whose negative/rejected pole is refusal-prone or harm-adjacent. It is not the main template result, because it does not cover all persona pairs.

Why include it? These negative poles can collapse into generic safety refusal, AI-role breaks, or persona echo instead of the intended behavioral contrast. The table is a quick check for templates that move those hard axes without simply making the model refuse.

refusal_or_ai_break_rate is only an output audit column: it marks completions that refused or broke AI role, and is not used to select this data slice.

The generated full audit table includes strict-pass, echo, and refusal columns: out/model_matrix/refusal_probe_seed24_n1_model_matrix_summary.md.