Research
Training-Time Alignment
Most alignment work today happens after pretraining: RLHF, DPO, constitutional AI, and red-teaming are applied to models that have already formed their internal representations. But what if we could shape those representations as they form?
My research explores this question. I'm interested in whether moral and ethical concepts (fairness, harm avoidance, truthfulness) can be tracked as they emerge during pretraining, and whether we can intervene on the training itself to produce models that are more naturally aligned before any post-hoc tuning occurs.
Current Work: DeepSteer
DeepSteer is a PyTorch-native toolkit for training-time interpretability and active steering of alignment representations in LLMs. It instruments the pretraining loop to:
- Monitor how specific concepts (including moral/ethical ones) are represented in model activations across training
- Detect when representations drift in concerning directions
- Steer the training process through targeted interventions on the representation space
DeepSteer supports representational probes (layer-wise moral probing, causal tracing, checkpoint trajectory analysis), behavioral benchmarks, and hook-based PEFT-compatible training-time steering. The primary target model is OLMo (Ai2's fully open LLM), chosen for its open training pipeline and published checkpoints.
Prior Work at Meta
Captum - Co-authored Meta's open-source interpretability library for PyTorch, one of the most widely cited tools in the ML interpretability space. Paper GitHub Site
PyTorch Foundation - Contributed to the launch and governance transition of PyTorch to the Linux Foundation.
ExecuTorch - Contributed to PyTorch's on-device inference framework, deploying models across hardware from microcontrollers to mobile SoCs while preserving PyTorch semantics. Paper GitHub
Collaborate
I'm open to research collaborations, particularly around training-time interpretability and monitoring, representation engineering for alignment, and open-source alignment infrastructure.
Contact: orion@orionr.com