Research

Training-Time Alignment

Most alignment work today happens after pretraining: RLHF, DPO, constitutional AI, and red-teaming are applied to models that have already formed their internal representations. But what if we could shape those representations as they form?

My research explores this question. I'm interested in whether moral and ethical concepts (fairness, harm avoidance, truthfulness) can be tracked as they emerge during pretraining, and whether we can intervene on the training itself to produce models that are more naturally aligned before any post-hoc tuning occurs.

Current Work

I'm building a framework for training-time representation monitoring and active steering. The goal is to instrument the pretraining loop to:

The primary target model is OLMo (Ai2's fully open LLM), chosen for its open training pipeline and published checkpoints. This work is in active development - I expect to open-source the tooling and publish results as the research matures.

Prior Work at Meta

Captum - Co-authored Meta's open-source interpretability library for PyTorch, one of the most widely cited tools in the ML interpretability space. Paper GitHub Site

PyTorch Foundation - Contributed to the launch and governance transition of PyTorch to the Linux Foundation.

ExecuTorch - Accelerated on-device AI inference grounded in the PyTorch ecosystem.

Collaborate

I'm open to research collaborations, particularly around training-time interpretability and monitoring, representation engineering for alignment, and open-source alignment infrastructure.

Contact: orion@orionr.com