Projects
Projects
Visual Estimation in VLMs
We implicitly make many rapid estimations of properties like weights and volumes of objects as we navigate the world. Deploying vision-language models (VLMs) in robotics applications requires them to have some of these same abilities. VLMs trained on text-image pairs may fall short on such important tasks because they lack the grounding that comes from real interaction via embodied systems. We create a dataset of precise, hand-measured ground truth values to benchmark a wide variety of models (including robotics foundation models). Our emphasis is to move beyond well-studied spatial estimation abilities. We are also studying the interventions we can do, via tool augmentation or fine-tuning, to improve the current models on these tasks.
Paper and dataset coming soon
Benchmarking Compositional Reasoning in LLMs
Can LLMs take basic tools and fashion more complex tools out of them (as per given rules) to solve hard problems efficiently? Such scenarios, besides occurring in human societies, also occur in games such as Minecraft and Factorio. We are interested in designing benchmarks that can test for such a skill. Lisp programming is promising since we have primitive functions and we compose them via recursion to build higher-order functions. However, to avoid degeneration to program synthesis and recall from training data memory, we have to adequately obfuscate the syntax.