Projects

Visual Estimation in VLMs

We implicitly make many rapid estimations of properties like weights and volumes of objects as we navigate the world. Deploying vision-language models (VLMs) in robotics applications requires them to have some of these same abilities. VLMs trained on text-image pairs may fall short on such important tasks because they lack the grounding that comes from real interaction via embodied systems. We create a dataset of precise, hand-measured ground truth values to benchmark a wide variety of models (including robotics foundation models). Our emphasis is to move beyond well-studied spatial estimation abilities. We are also studying the interventions we can do, via tool augmentation or fine-tuning, to improve the current models on these tasks.

Paper and dataset coming soon

Benchmarking Compositional Reasoning in LLMs

Can LLMs take basic tools and fashion more complex tools out of them (as per given rules) to solve hard problems efficiently? Such scenarios, besides occurring in human societies, also occur in games such as Minecraft and Factorio. We are interested in designing benchmarks that can test for such a skill. Lisp programming is promising since we have primitive functions and we compose them via recursion to build higher-order functions. However, to avoid degeneration to program synthesis and recall from training data memory, we have to adequately obfuscate the syntax.

Memory Bandwidth Regulation for LLM Inference of Shared Memory Systems

with Dr. Deepak Gangadharan, Assistant Professor, IIIT Hyderabad

LLMs are increasingly being deployed on affordable, consumer hardware where GPU and CPU share the RAM (integrated GPU) such as embedded device Jetson Nano or compact desktop DGX Spark. LLM inference generates heavy traffic to the memory controller; this can hurt the latency-sensitive tasks running on the CPU. Our aim is to coordinate GPU-CPU memory accesses such that CPU tasks complete within their tight deadlines and GPU gives us good throughput.