Large vision-language models have recently demonstrated impressive performance in planning and control tasks, driving interest in their application to real-world robotics. However, their deployment for reasoning in embodied scenarios is constrained …
To enable embodied agents to assist and operate effectively over extended timeframes, it is crucial to develop models capable of forming and accessing memories to remain contextualized in an environment. In the current paradigm of training …
Video Large Language Models (Video-LLMs) excel at understanding videos in-context, assuming full access to the video when answering queries. However, these models face challenges in streaming scenarios where hour-long videos must be processed online, …