Publications

You can also find my articles on my Google Scholar profile.

Taxonomy, Opportunities, and Challenges of Representation Engineering for Large Language Models

Jan Wehner, Sahar Abdelnabi, Daniel Tan, David Krueger, Mario Fritz in arXiv preprint, 2025

This survey paper reviews the literature on Representation Engineering, a technique for controlling LLMs through their internal representations. We set out a unifying taxonomy, describe methods and applications and showcase weaknesses and opportunities.

[Article] [PDF]

Safety is Essential for Responsible Open-Ended Systems

Ivaxi Sheth, Jan Wehner, Sahar Abdelnabi, Ruta Binkyte, Mario Fritz forthcoming in SSI-FM ICLR 2025 Workshop, 2025

Open-ended AI is a growing paradigm where AI continuously explores novel and interesting artifacts. This position paper describes specific safety challenges in Open-Ended AI and how they can be mitigated.

[Article] [PDF]

Representation Noising: A Defence Mechanism Against Harmful Finetuning

Domenic Rosati, Jan Wehner, Kai Williams, Łukasz Bartoszcze, David Atanasov, Robie Gonzales, Subhabrata Majumdar, Carsten Maple, Hassan Sajjad, Frank Rudzicz in NeurIPS, 2024

We propose Representation Noising which prevents harmful fine-tuning by removing harmful representations.

[Article] [PDF]

Immunization against harmful fine-tuning attacks

Domenic Rosati, Jan Wehner, Kai Williams, Lukasz Bartoszcze, Hassan Sajjad, Frank Rudzicz in Findings of the Association for Computational Linguistics: EMNLP 2024, 2024

LLMs can be fine-tuned with harmful data to remove their safeguards. We formalize the problem and set out conditions for a solution.

[Article] [PDF]

Explaining Learned Reward Functions with Counterfactual Trajectories

Jan Wehner, Frans Oliehoek, Luciano Cavalcante Siebert forthcoming in AIEB workshops at ECAI 2024, 2024

We propose a method for explaining reward functions by showing the rewards given to counterfactual trajectories.

[Article] [PDF]

On robust vs fast solving of qualitative constraints

Jan Wehner, Michael Sioutis, Diedrich Wolter in Journal of Heuristics, 2023

This paper introduces the notion of Robustness to Qualitative Contraint Networks and finds a tradeoff between speed and robustness in heuristics for solving QCNs.

[Article] [PDF]