Publications · Amish Sethi

Retrieval-Augmented Vision-Language-Action Model

Jiani Huang*, Brandon Y. Yang*, Amish Sethi*, Yuchen Zheng, Christopher Watson, Jianing Qian, Mayur Naik, Dinesh Jayaraman

Under review at Conference on Robot Learning (CoRL) 2026 · senior thesis

RA-VLA retrieves a semantically similar demonstration from a large robot dataset, warps it to the current scene with a multimodal language model, and overlays it as guidance for a fine-tuned policy. It improves task success on the RoboLab-120 simulator and real-world tasks with no new demonstrations collected.

Do Diffusion Models Learn to Generalize Basic Visual Skills

Amish Sethi, Boya Zeng, Wenhao Chai, Zhuang Liu

Under review at NeurIPS

A controlled study trains diffusion models on synthetic data that isolates size, position, and rotation, showing they interpolate well within the training distribution yet fail to extrapolate beyond it.

Delta Activations: A Representation for Finetuned Large Language Models

Zhiqiu Xu*, Amish Sethi*, Mayur Naik, Ser-Nam Lim

Under review at COLM · published at the NeurIPS 2025 ER Workshop

Delta Activations represent a finetuned model by how its internal activations shift from a base model, which clusters models by domain and enables task based retrieval. I led the experiments and released more than 700 finetuned open-source models on Hugging Face.

arXiv Website

ESCA: Contextualizing Embodied Agents via Scene-Graph Generation

Jiani Huang, Amish Sethi*, Matthew Kuo*, Mayank Keoliya, Neelay Velingker, JungHo Jung, Ziyang Li, Ser-Nam Lim, Mayur Naik

NeurIPS 2025 · Spotlight, top 3%

ESCA turns video into spatio-temporal scene graphs that give vision language models explicit spatial context, cutting perception errors from 69 percent to 30 percent on EmbodiedBench without retraining the underlying models.

Paper Website Code Google Blog

Dolphin: A Programmable Framework for Scalable Neurosymbolic Learning

Aaditya Naik, Jason Liu, Claire Wang, Amish Sethi, Saikat Dutta, Mayur Naik, Eric Wong

ICML 2025

Dolphin pairs symbolic reasoning with neural computation on a CPU and GPU hybrid, reaching up to 62 times faster convergence than baselines across 13 benchmarks that span text, image, and video.

arXiv

CLAM: Unifying Finetuning, Quantization, and Pruning by Chaining LLM Adapter Modules

Neelay Velingker, Amish Sethi*, Jason Liu*, William Dodds*, Zhiqiu Xu, Saikat Dutta, Mayur Naik, Eric Wong

ICML 2024 ES-FoMo Workshop

CLAM unifies finetuning, quantization, and pruning as weight based adaptations that chain freely, letting CLAM compositions match uncompressed models while using 86 percent fewer bits.

Paper

VINE: A Foundation Model for Video Understanding

Amish Sethi*, Jiani Huang*, Matthew Kuo*, Ziyang Li, Mayank Keoliya, Neelay Velingker, Mayur Naik, Ser-Nam Lim

Foundation model behind ESCA

VINE turns video into probabilistic scene graphs of entities, attributes, and relations. Trained on more than 87,000 videos with neurosymbolic learning, it is promptable and finetunable for many downstream tasks.

Website Code Model Dataset

Functional Genetic Biomarkers of Alzheimer's Disease and Gene Expression from Peripheral Blood

Andrew Ni*, Amish Sethi*, Alzheimer's Disease Neuroimaging Initiative

bioRxiv, 2021 · ISEF Finalist

Using clustering and dimensionality reduction, this work found genes that differ in Alzheimer's and predicted the disease from peripheral blood gene expression with 98 percent accuracy. Cited more than 8 times.

Paper