Hey! I’m in my final year as a Machine Learning D.phil student at the OATML group at University of Oxford. I’m co-supervised by Prof. Yarin Gal, Prof. Tim Rudner and Dr. Adrien Gaidon.
I’m currently working on designing methods, architectures and benchmarks to enable embodied agents to learn to do long horizon tasks by creating and accessing episodic memories. I am specifically focusing on transformer-based agents trained through large-scale RL. Some of the topics I have done research on over the previous two years are:
I also collaborate closely with researchers from Toyota Research on topics related to robot learning.
Prior to starting my Ph.D, I had been working as a deep learning researcher at Wayve, a startup based in London that is employing end-to-end deep learning to achieve autonomous driving. Before that I graduated from a Machine Learning Research Master’s at Mila (Sept 2020) where I primarily did research on the topics of bayesian deep learning, continual learning and inverse reinforcement learning. I was also an ED&I Fellow with the MPLS department at the University of Oxford in 2022-2023 cohort.
I was introduced to robotics during a year-long research internship at IIIT, (Hyderabad, India 2017-2018), where I worked on Multi-Robot SLAM and view-invariant recognition for place recognition and relocalisation.
Download my resumé.
D.Phil Machine Learning (AIMS CDT), 2024
University of Oxford
Research Master's in Machine Learning, 2020
Montreal Institute of Learning Algorithms
B.Tech in Maths and Computing (Applied Mathematics), 2016
Delhi Technological University (DTU/DCE)
Here I:
Fine-tuning vision-language foundation models has emerged as a powerful approach to leveraging internet-scale data for generalization in downstream applications. A particularly promising source of representations already used in supervised learning can be derived from pretrained diffusion models. These representations have been shown to capture both high-level semantic information about a scene and low-level spatial information. However, this potential has not yet been realized for control-based robotics tasks, which often feature vision-based tasks with language instructions. This paper presents Stable Control Representations, which uses pretrained text-to-image diffusion models as a source of vision-language representations for downstream control policies. We show that these representations are competitive on a variety of challenging simulated control benchmarks and demonstrate strong performance on tasks that require generalization to unseen objects at test time. Crucially, we show that they enable generalization on a challenging open-vocabulary navigation benchmark on which it outperforms all other pretrained approaches. Finally, to examine their robustness, we deconstruct diffusion model representations by ablating different design choices involved in extracting representations from diffusion models and presenting insights into their failure modes.