Research

Training-Time Alignment

Most alignment work today happens after pretraining: RLHF, DPO, constitutional AI, and red-teaming are applied to models that have already formed their internal representations. But what if we could shape those representations as they form?

My research explores this question. I'm interested in whether moral and ethical concepts (fairness, harm avoidance, truthfulness) can be tracked as they emerge during pretraining, and whether we can intervene on the training itself to produce models that are more naturally aligned before any post-hoc tuning occurs.

Current Work: DeepSteer

DeepSteer is a PyTorch-native toolkit for training-time interpretability and active steering of alignment representations in LLMs. It instruments the pretraining loop to:

DeepSteer supports representational probes (layer-wise moral probing, causal tracing, checkpoint trajectory analysis), behavioral benchmarks, and hook-based PEFT-compatible training-time steering. The primary target model is OLMo (Ai2's fully open LLM), chosen for its open training pipeline and published checkpoints.

GitHub

Prior Work at Meta

Captum - Co-authored Meta's open-source interpretability library for PyTorch, one of the most widely cited tools in the ML interpretability space. Paper GitHub Site

PyTorch Foundation - Contributed to the launch and governance transition of PyTorch to the Linux Foundation.

ExecuTorch - Contributed to PyTorch's on-device inference framework, deploying models across hardware from microcontrollers to mobile SoCs while preserving PyTorch semantics. Paper GitHub

Collaborate

I'm open to research collaborations, particularly around training-time interpretability and monitoring, representation engineering for alignment, and open-source alignment infrastructure.

Contact: orion@orionr.com