Sitemap

A list of all the posts and pages found on the site. For you robots out there, there is an XML version available for digesting as well.

Posts

Should the AI Safety Community Prioritize Safety Cases?

26 minute read

I recently wrote an Introduction to AI Safety Cases. It left me wondering whether they are actually an impactful intervention that should be prioritized by the AI Safety Community.

Safety Cases Explained: How to Argue an AI is Safe

16 minute read

Safety Cases are a promising approach in AI Governance inspired by other safety-critical industries. They are structured arguments, based on evidence, that a system is safe in a specific context. I will introduce what Safety Cases are, how they can be used, and what work is being done on this atm. This explainer leans on Buhl et al 2024. At the end, I survey expert opinions on the promise/weaknesses of Safety Cases.

A Call for Better Risk Modelling

8 minute read

TL;DR: The EU’s Code of Practice (CoP) mandates AI companies to conduct state-of-the-art Risk Modelling. However, the current SoTA is has severe flaws. By creating risk models and improving methodology, we can enhance the quality of risk management performed by AI companies. This is a neglected area, hence we encourage more people in AI Safety to work on it. Work on Risk Modelling is urgent because the CoP is to be enforced starting in 9 months (Aug, 2, 2026).

Learning from the Luddites: Implications for a modern AI labour movement

14 minute read

The Luddites were a social movement of English textile workers in the 19th century, famous for smashing the machines that were replacing their jobs. The term Luddite is now used to describe opponents of new technologies (often in a derogatory way). However, I believe many people using the term misunderstand what the Luddites did and wanted. Indeed, the Luddites and their ultimate failure can teach a modern AI-labour movement valuable lessons.

30.9.2045

14 minute read

The sleep pod opens with a soft hiss. Five hours—that’s all I need anymore. The pod regulated everything through the night: temperature shifting through optimal sleep cycles, gentle massages during deep sleep, binaural tones guiding my brain through REM. I step out feeling completely rested.

Envision paradise in the face of catastrophe

7 minute read

You might think a horrible catastrophe is imminent due to climate change, plummeting birth rates, democratic backsliding or whatever your doom-of-choice is. In the face of such catastrophes it might seem unimportant, distracting or downright offensive to consider what beautiful futures might look like. However, I believe that thinking about the end goal is essential for steering towards positive outcomes.

15 Levers for influencing Frontier AI Companies

17 minute read

The development of AGI could be the most important event of our lifetimes. Ensuring that AI is developed and deployed safely could be the most impactful thing many of us can work on. However, the development of Frontier AI systems is happening in only a handful of companies. This leaves the rest of us to wonder: How can we influence Frontier AI Companies?

My Predictions for AI 2027: Metaculus Forecasting Questions

26 minute read

The AI 2027 Tournament on Metaculus poses 16 questions about the near term future of AI. They are derived from the AI 2027 scenario and cover predictions about technological, economical, political and societal developments. I made predictions for 6 of the questions and want to share my reasoning. I believe there is a virtue in making public predictions, open my reasoning up to criticism and contribute to the discourse.

Open Challenges in Representation Engineering

8 minute read

This post summarizes the taxonomy, challenges, and opportunities from a survey paper on Representation Engineering that I’ve written with Sahar Abdelnabi, David Krueger, and Mario Fritz. Cross posted from the AI Alignment Forum

portfolio

Portfolio item number 1

Short description of portfolio item number 1

Portfolio item number 2

Short description of portfolio item number 2

publications

On robust vs fast solving of qualitative constraints

Jan Wehner, Michael Sioutis, Diedrich Wolter in Journal of Heuristics, 2023

This paper introduces the notion of Robustness to Qualitative Contraint Networks and finds a tradeoff between speed and robustness in heuristics for solving QCNs.

[Article] [PDF]

Explaining Learned Reward Functions with Counterfactual Trajectories

Jan Wehner, Frans Oliehoek, Luciano Cavalcante Siebert forthcoming in AIEB workshops at ECAI 2024, 2024

We propose a method for explaining reward functions by showing the rewards given to counterfactual trajectories.

[Article] [PDF]

Immunization against harmful fine-tuning attacks

Domenic Rosati, Jan Wehner, Kai Williams, Lukasz Bartoszcze, Hassan Sajjad, Frank Rudzicz in Findings of the Association for Computational Linguistics: EMNLP 2024, 2024

LLMs can be fine-tuned with harmful data to remove their safeguards. We formalize the problem and set out conditions for a solution.

[Article] [PDF]

Representation Noising: A Defence Mechanism Against Harmful Finetuning

Domenic Rosati, Jan Wehner, Kai Williams, Łukasz Bartoszcze, David Atanasov, Robie Gonzales, Subhabrata Majumdar, Carsten Maple, Hassan Sajjad, Frank Rudzicz in NeurIPS, 2024

We propose Representation Noising which prevents harmful fine-tuning by removing harmful representations.

[Article] [PDF]

Safety is Essential for Responsible Open-Ended Systems

Ivaxi Sheth, Jan Wehner, Sahar Abdelnabi, Ruta Binkyte, Mario Fritz forthcoming in SSI-FM ICLR 2025 Workshop, 2025

Open-ended AI is a growing paradigm where AI continuously explores novel and interesting artifacts. This position paper describes specific safety challenges in Open-Ended AI and how they can be mitigated.

[Article] [PDF]

Taxonomy, Opportunities, and Challenges of Representation Engineering for Large Language Models

Jan Wehner, Sahar Abdelnabi, Daniel Tan, David Krueger, Mario Fritz in arXiv preprint, 2025

This survey paper reviews the literature on Representation Engineering, a technique for controlling LLMs through their internal representations. We set out a unifying taxonomy, describe methods and applications and showcase weaknesses and opportunities.

[Article] [PDF]

teaching

Teaching experience 1

This is a description of a teaching experience. You can use markdown like any other post.

Teaching experience 2

This is a description of a teaching experience. You can use markdown like any other post.

Jan Wehner

Sitemap

Pages

Posts

portfolio

publications

talks

teaching